Quantcast
Channel: JeffSaidSo
Viewing all 40 articles
Browse latest View live

When Disaster Strikes…

$
0
0

So, everyone would agree that a helping hand is nice to have now and then. Like the time I thought it would be a good idea to skateboard while holding onto my brother’s car as he drove down the street. It was his helping hand reaching down to pick me up off the road (bleeding) and sitting me in the car that I won’t forget (I still have that scar on my hip). It was brother helping brother – an understanding that when one is down, the other will help get him on his feet (hopefully before mom sees so that we could get our story coordinated as to how it happened). In the UCS world, the brothers in this scenario are the Fabric Interconnects (I’m not sure who the mother is).

There are times when a Fabric Interconnect might encounter a software failure – for whatever reason, and land at the “loader” prompt. It’s rare, but it can happen. The loader prompt is a lonely place and it’s not pleasant. The good news is that if you still have a single FI working, you can use it to resurrect the broken FI. First off, if you ever find yourself staring at the loader prompt, stop cursing and just try to unplug it and plug it back in. Don’t worry if “dir” shows no files – just try it. I’ve seen it work and the FI comes right back on the next boot. If that doesn’t work, you have some work to do…

The loader is just that – a loader. It’s “loads” an OS – like a bootstrap. You need 3 files to permanently get out of the loader – kickstart, system, and the UCSM image. Luckily all of these live on your remaining FI. The bad news is you can’t get to them without bringing it down as well. So, if you’re in production and can’t get afford to bring down the entire UCS pod, you should stop reading and call TAC. They can get you the 3 magic files you need and can get it all running without bringing anything additional offline. But if you’re in a situation where you can afford to take down the remaining FI, you can fix this problem yourself.

To make this work, you will need:

  • Non-functional FI
  • Functional FI
  • FTP Server
  • TFTP Server

Your basic recovery will include:

  1. Disconnect the L1/L2 cables between the FI’s to avoid messing up the cluster data they share
  2. Boot FI-A to loader
  3. Force FI-B to loader
  4. Boot kickstart on FI-B
  5. Assign IP address to FI-B
  6. FTP kickstart, image, and ucsm images from FI-B to an FTP server
  7. Reboot FI-B back to its normal state
  8. Get kickstart image onto TFTP server (unless FTP/TFTP are the same server)
  9. Boot kickstart image on FI-A via TFTP server
  10. FTP kickstart, system, and ucsm images down to FI-A
  11. Copy ucsm image file to the root
  12. Load system image on FI-A
  13. “Activate” the firmware on FI-A
  14. Connect L1/L2 cables back and rejoin the cluster

Reboot the “good” FI (known in this document now as FI-B), and begin pressing CTRL+R to interrupt the boot process. You will find FI-B now stops at the loader prompt too. Now type

boot /installables/switch/ <tab>

which will show you all files in this folder. You are looking for the obvious kickstart file and you want the latest one. To make the display easier to read, I would type this:

boot /installables/switch/ucs-6100-k9-kickstart <tab>

Backspace is not an option so if you make an error, use the arrow keys and the “delete” key to fix typos.

Select the latest image, hit enter, and FI-B now beings to boot the kickstart image. Give it a few minutes and you should find it stops at the “boot” prompt. This prompt is not as lonely as the loader prompt, but it’s still not a fun place to be (at least you can backspace now). You actually will have much more functionality then you did with loader, but won’t need it for this exercise. At this point you need to assign and IP address to FI-B so that you can FTP the kickstart image to an FTP server. The commands will look like this:

#Config t

#int mgmt 0

#ip address X.X.X.X <mask>

#no shut

Wait 10-15 seconds

#<ctrl+z to return the shell to the top level>

# copy bootflash:installables/switch/ucs-6100-k9-kickstart <tab>

Select the latest version and copy it to the FTP server.

DO NOT USE THE FILES IN THE ROOT OF BOOTFLASH AT ANY TIME DURING THIS PROCESS. Nothing catastrophic will happen, but the FI will not boot in the end.

The shell will prompt you for ftp server address and credentials and it should look something like this:

You need to allow about 10-15 additional seconds after you “no shut” the interface for the IP to become active and useable.

You now need to copy the system and UCSM images as well as you will need them soon enough. The other two files will look something like:

installables/switch/ucs-manager-k9.2.1.0.418.bin

installables/switch/ucs-6100-k9-system.5.0.3.N2.2.10.418.bin

again – your versions will be different

Once you are returned to the boot prompt, and all 3 files are copied, the kickstart file is on the FTP server. You should boot FI-B back into production. You now need to get the kickstart file available via TFTP using whatever process you do to make that happen. One word of caution here – TFTP blows. It runs on UDP, it’s slow, and it has no error checking. If your first attempt at booting fails, try a different TFTP server program (trust me on this – I had bruises on my head from banging it on the wall). Once the file is available via TFTP, return to FI-A which is at the loader prompt. You will now boot that kickstart image via TFTP using these commands:

Incidentally, you cannot ping this address from an outside station. Just FYI

Then it begins loading the image. It should take just a few seconds to actually start booting. FI-A will not land at the boot prompt like FI-B did earlier. You need to rebuild the filesystem on FI-A, so type:

#init system

This will take a few minutes. When it’s done, you can now use FTP to copy the 3 files down to FI-A. Use this command to retrieve each file:

#Copy ftp: bootflash:

The shell will prompt you for everything it needs to copy the files.

After all 3 are copied, one very important command needs to be run now and it won’t make sense, but you must do this. You need to run this command:

Copy bootflash:/ucs-manager-k9.2.1.0.418.bin bootflash:/nuova-sim-mgmt-nsg.0.1.0.001.bin

The nuova-sim-mgmt-nsg.0.1.0.001.bin is an exact name that is needed here.

Now that you have all 3 files local on the FI, you would be able to recover much quicker if the FI were to lose power. At this moment, if that happened, you would be returned to the loader prompt, but you would be able to boot via bootflash instead of TFTP. Anyway, you are now at the boot prompt and need to finish booting. Type load bootflash://ucs-6100-k9-system.5.0.3.N2.2.10.418.bin. This will start loading the system image and when it’s done loading, it will look for the UCSM image that you also copied and the FI should come up. It will walk you through the setup menu and since the L1/l2 cables are not connected, I would go ahead and set it up as standalone – we will join it to the cluster soon. Once you are logged into the FI, you need to activate the current firmware to set the startup variables. The easiest way to do this is in the GUI. Just go into Firmware Management and select “Activate Firmware” and select the FI. You will likely see that no version is in the startup column. Regardless, you need to activate the version that is already there. If it doesn’t let you, exit Firmware Management and navigate to the Fabric Interconnect on the left-side Tree menu and activate the firmware from there using the “force” option. This will fix up the ucsm image file that we copied to the root as well (turns it into a symbolic link).

That’s about it. You should be OK to erase the config on FI-A (#connect local-mgmt), hook up the L1/l2 cables and rejoin the cluster on the next reboot. I really hope you don’t need to ever use this… I mainly wrote this blog for myself because in the lab we do a lot of crazy stuff and I often forget a step here and there. So I wanted it all written down to refer back to and I’ve wanted to get this one done for quite some time.

Thanks for reading…

-Jeff


Resetting UCS to Factory Defaults

$
0
0

So, way back in early 2009, Sean McGee and I decided to work over the weekend in San Jose to get more stick time with “Project California” as UCS was called then. We borrowed a system from someone, backed it up, and started discovering how UCS worked. We had no help locally since it was a weekend and one thing I wanted to know was how to erase the configuration and start over. We were still months away from documentation and the online help inside the pre-1.0 UCSM was very incomplete. We eventually did figure out how to erase the configuration and start over, but we had to stumble upon it. Resetting UCSM is a well-documented process now, but I thought I’d write this post to cut through the pre-requisites and making sure proper backups are done, etc. I just wanted to give you the commands to get the job done. You’re on your own to make sure you really want to do this.

In another blog post I covered how to restore a failed Fabric Interconnect (FI). It gives you some insight into a complete and total rebuild of an FI in a worst-case scenario. While that would accomplish the “factory defaults” you desire, it’s a painful way to get there. Thankfully, the “erase” process is pretty easy. There is no way to do this in the GUI so grab your favorite ssh client and connect to either FI. Once connected, type the following:

FI-A # Connect local-mgmt

FI-A (local-mgmt)# erase config

That’s it! You’ll need to confirm the command before it executes, but that will start the process. You then need to repeat the process on the other FI. There is a way you could erase both of them by connecting to the VIP and not directly to the FI, but I’m going to cover that feature in another post because it’s pretty cool all by itself.

-Jeff

UCS Command Line Shells

$
0
0

So, about 2 years ago I was with a customer who had opted to purchase UCS over their incumbent HP hardware for their private cloud build. As a first step, we upgraded the firmware on the UCS system. What I did not know at the time was that the mgmt0 cable plugged into the “B” Fabric Interconnect (FI) was showing link, but was not on the right vlan (or wasn’t passing traffic). When it came time in the upgrade to failover the management instance of UCSM to the “B side”, we lost access completely to UCS manager. This and other seemingly related events (but were actually totally unrelated in hindsight) led me to believe that UCSM had failed in some manner and started me down a multi-hour troubleshooting session that I really wished had never happened. I opened an enhancement request to allow UCSM to detect this situation in the future and move UCSM back to the originating FI if it is unable to find the default gateway. Had I known this trick that I am about to tell you concerning the UCS shells, I might have been smart enough to get out of my situation much faster. The sad thing is I actually did know this – it was just knowledge from so early on in my UCS learning curve that I didn’t fully absorb the importance of it. So, now is your chance to start absorbing…

If you have spent any time around UCS (and if you are reading this, you probably have), you know that there is a command-line interface in addition to the provided GUI. The actual “UCS” command line is the starting point “shell” that you are automatically in when you ssh to the UCSM Virtual IP (VIP). We’ll refer to this as the root shell for the purposes of this document. Although root is the main shell, there are many sub-shells available to you in UCS that accomplish various tasks. This post will focus on accessing two specific sub-shells, local-mgmt and NXOS. This article assumes you have knowledge of what each of these shells is for and will not discuss the details of these sub-shells, but will give you an understanding of how to navigate the root shell to gain access to these other sub-shells.

It helps if you think of the shells in hierarchical manner (such as the graphic above). As I mentioned, there are additional sub-shells beyond what are listed above, but NXOS and local-mgmt are by far the most-used, and they are unique in how you can access them. Because the root shell sits above the sub-shells of both fabrics, it allows you to access either sub-shell of either fabric (assuming you are connected to the UCSM VIP and not an individual FI). For instance:

Notice that I started out on Fabric B because that was the controlling instance (FI) of UCSM (you can flip the controlling instance back and forth without data plane disruption – a post for another day). While on Fabric B, I typed connect local-mgmt A. The UCSM root shell then connected me to the local-mgmt sub-shell on fabric A. Had I typed just connect local-mgmt (omitting the “A”), it would default to the fabric that the VIP is currently on (in this case, B). From the root shell, you can do the same type of connection to the NXOS sub-shell on either fabric as well. You cannot jump from a sub-shell to any other sub-shell. You must “exit” back to the root shell to enter any sub-shell.

Back to my bad day story…had I remembered this trick, how would I have avoided the issue? Well, I could always access the A Fabric Interconnect. From there, I could have run connect local-mgmt B and
accessed UCSM which was running just fine on Fabric Interconnect B, and flipped UCSM back to Fabric Interconnect A using local mgmt commands. The success in doing that would have instantly led me to the mgmt0 connection on the B fabric. Things like this are much easier to spot the second time around though – and I saw it again at a customer in production who had a faulty connection to FI-B. In that instance, fixing it was really easy (and they thought I was really smart – no, I didn’t tell them the truth).

That’s pretty much all there is to it. If you want to play around with the various other shells, you can type connect ? at the root shell and it will return all the possible devices you can connect to.

 

P.S. Ironically, the same day I wrote this article, I got a call from a co-worker who “could not connect back to UCSM after the primary FI rebooted during a firmware upgrade”. We used this trick (which he thought was way cool) and then discovered later that he had a flaky Ethernet cable in mgmt0 in the (formerly) subordinate FI. If you’re curious about why the enhancement I referenced above didn’t help here, it’s because the enhancement (mgmt0 interface monitoring) is enabled by default on all NEW installations but left at the previous setting on any UPGRADES (because change is a bad thing). I believe that change went into the 2.0 release.

 

Thanks for your time.

 

-Jeff

 

ENM Source Pinning Failed – A Lesson in Disjoint Layer 2

$
0
0

So, today’s article is on VLAN separation, the problems it solves, and the problems it sometimes creates. Not all networks are cut from the same cloth. Some are simple and some are complex. Some are physical and some are virtual. Some are clean while others are quite messy. The one thing that they all have in common is that Cisco UCS works with all of them.

A Look Back

In UCS, we have a topology that we call “disjoint Layer 2” (DJL2) which simply means that a there are networks upstream from UCS that are separated from one another and cannot all be accessed by the same UCS uplink port (or port channel). For instance, you might have upstream VLANs 10, 20, and 30 on UCS uplink port 1 and VLANs 40, 50, and 60 on UCS uplink port 2. Prior to UCS 2.0, this configuration was not supported (in End Host Mode (EHM)). The main reason is that prior to 2.0, when VLANs were created, they were instantly available on ALL defined uplink ports and you could not assign certain VLANs to certain uplink ports. In addition to this, UCS uses the concept of a “designated receiver” (DR) port that is the single port (or port channel) chosen by UCSM to receive all multicast and broadcast traffic for all VLANs defined on the Fabric Interconnect (FI). To make this clear, UCS receives all multicast/broadcast traffic on this port only and drops broadcast/multicast traffic received on all other ports. Unless you have DJL2, this method works really well. If you do have DJL2, this would lead to a problem if you defined the above VLAN configuration and plugged it into pre-2.0 UCS (in EHM). In this situation, UCS would choose a designated receiver port for ALL VLANs (10-60) and assign it to one of the available uplinks. Let’s say the system chose port 1 (VLANs 10, 20, and 30) for the DR. In that situation, those networks (10, 20, 30) would work correctly, but VLANs 40, 50, and 60 (plugged into port 2) would not receive any broadcast and multicast traffic at all. The FI will learn the MAC addresses of the destinations on port 2 for 40, 50 and 60, but necessary protocols like ARP, PXE, DHCP (just to name a few) would be broken for these networks. In case you’re wondering, pin groups do not solve this problem so don’t waste your time. Instead, you need UCS 2.0+ and DJL2 which allows specific VLANs to be pinned to specific uplink ports. In addition, you now have a DR port for each defined VLAN as opposed to globally for the each FI. If you want to know more about the DR port, how it works, and how you can see which ports are the current DR on your own domain, please see the Cisco whitepaper entitled “Deploy Layer 2 Disjoint Networks Upstream in End Host Mode” located here: http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns944/white_paper_c11-692008.html

The Rules

You’ve probably figured out that if this were super easy, I wouldn’t be writing about it. Well, yes and no. It’s easy to turn on the DJL2 feature, but there are some lesser known rules around making it work. There is no “enable DJL2” button and you won’t find it by that name in UCSM. You simply enable it when you assign specific VLANs to specific uplink ports. It’s then automatically on. But many people make a mistake here. Staying with the above example, you want port 1 to carry VLANs 10-30 and port 2 to carry VLANs 40-60. When you first enter VLAN manager, you will see VLANs 10-60 defined and carried on ports 1 and 2. You might think to just take VLANs 40-60 and assign them to port 2. Well, that does remove 40-60 off of port 1, but it would also leave 10-30 on port 2 (along with 40-60). So you must isolate VLANs to their respective uplink ports. Furthermore, if you define a new VLAN, you need to go into VLAN manager and pin it to the port(s) you intend and remove it from the ports it should not be on. The main thing to remember here is that the original UCS rules on VLAN creation have not changed. That is, a created VLAN is always available on all uplink ports. That still happens even when you have DJL2 setup because UCS manager has no idea where to put that new VLAN unless you tell it – so it follows the original rule. I recommend looking at your VLAN config in NXOS (show VLAN) before you run it in production. This will verify that the changes you wanted to make are truly that changes you made in the GUI.

ENM Source Pinning Failed

So now we have DJL2 setup properly on our uplink. Let’s look at the server side as it is often an area of confusion. It’s probably also the way most of you found this blog entry because you googled for the term “ENM Source Pinning Failed”. Let me explain why. When you create vNICs on a service profile using the config we had above (10-30 on port 1 and 40-60 on port 2), you are not able to trunk/tag VLANs from BOTH port 1 and port 2 to the same vNIC. For example, you can have have a single vNIC with VLANs 10, 20, and 30 and another vNIC with VLANs 40, 50, and 60 and both vNICs can be on the same server. But you CANNOT have a single vNIC with VLANs 10 and 40. If you do, the vNIC will go into an error state and will lose link until one of the VLANs is removed. The picture below might help – keep in mind that this diagram is very simplistic and that you can also get an ENM source pin failure with just a single FI:

The above illustration shows a configuration where the user wants to have VLANs 10-50 reach a single server, but this will not work in a DJL2 configuration and will result in ENM Source Pin Failure. Instead, the illustration below would achieve the desired result of VLANs 10-50 reaching the same server, but do not violate the DJL2 rules and would work fine.

Hopefully this helped explain DJL2 a little better and maybe alleviate the ENM Source Pinning error you might be getting.

Thanks for stopping by.

-Jeff

Update: I am running UCSM version 2.2.1d in the lab at present and came across a scenario I need to share. I had vlan 34 on FI-A and vlan 35 on FI-B. I did not need failover for this so each vlan was isolated to a single FI. I set up disjoint L2 correctly and “show vlan” in NXOS mode showed me that it was indeed setup the way I intended. However, any profiles that used vlan-34 on FI-A would throw the dreaded ENM source pin failed error in the profile. I spent several hours verifying and double-checking everything, but no joy. I then ran this command:
FI-A (nxos)# show platform software enm internal info vlandb id 34

I got nothing back. nada, zilch, nothing.
Running this on FI-B, I got what I expected:
FI-B (nxos)# show platform software enm internal info vlandb id 35
vlan_id 35
————-
Designated receiver: Eth1/1
Membership:
Eth1/1

Assuming something was really confused here, I rebotoed FI-A and all was well. If you encounter this, I don’t suggest you reboot the FI (unless you’re like me and it’s in a lab), and I would call TAC instead and let them find a non-disruptive method of correcting the issue. I just want to make a note of it here in case you feel like you got it all right and it still has an issue.

2nd update:

You can run into this problem from the opposite direction if you implement the uplinks first and then create the profiles later. If that’s the case, the profiles will throw an error before being applied saying “not enough resources overall” and “Failed to find any operation uplink port that carries all vlans of the vNICs”

Change Management with Change Tracking, Version Control… and Rollbacks

$
0
0

image

So… what we’re going to discuss here is a method by which you can implement some mechanism for change management, version control and rollback ability in UCS through service profile templates, but first, I’d like to give a little background.

My name is Loy Evans, and I’m a Cisco Data Center Consulting Systems Engineer, like Jeff.  In my past, I’ve held a number of varied jobs in the IT industry from Programmer to Router Jockey to Data Center Master Architect.  For the past few years, I’ve been consulting on UCS for customers in the Southeastern US.  I pride myself on understanding not just what customers ask for, but also the questions behind the question being asked.  This typically leads me to one of two things, either a business need, or a technical issue.  OK, mostly some of both, but there’s always a tendency in one or the other direction, but in my opinion, it’s very important to understand the root of the question, as there will likely be subtle differences in how you approach the answer and maybe even more subtle differences in how you present the solution.  When I talked to Jeff about some stuff I was doing, he thought it would be “in the wheelhouse” of what he considered core content for his blog. “So”… here we go.

Back to the lesson at hand… In this case, the customer brought up an issue that recently happened where they had a couple of changes that had taken place (a change in BIOS configuration, a Firmware update, and some operational network setting changes), and the way in which they had implemented it, they had no idea when the changes were done, or how to track the impact to the service profiles.  These changes were made by modifying policies that were already being referenced by the service profiles, thus making the change management difficult, if not impossible, and the ability to monitor the magnitude and rate of change non-existent.  On top of that, they had no process for implementing the changes in an orderly fashion.  In short, they had a great tool in UCS manager, but were not using it for efficient operational control.

I decided to step back and look at the problem from a little higher viewpoint.  My take on it was first: WHAT problem are you trying to solve, then HOW are you solving it? The answer to the former was simple: we have to adjust the environment to keep up with addressing a business need (adding/removing VLANs to a cluster) and fixing a technical issue (Firmware upgrade to support a new feature or BIOS configuration change to address a hypervisor bug).  The answer to the latter was not so simple.  In this case, they had not really worked out a system, and the implementation of the fixes followed bad form: modify configuration of a policy already in place.  I’d say that’s probably a worst practice.  I guess there is a bit of a gotcha… While UCS Manager is very flexible and you can just edit a policy at will, doesn’t mean you should.  The good news is you have options, the bad news is…you have options.

Screen Shot 2013-10-01 at 4.41.12 PM

So, my suggestion was to begin a practice of version control based on Policies and Templates.  The following is a description of a set of concepts and practices that we put into place, and I now use as a recommended practice to all customers as they look to operationalize UCS in their organizations.  For this discussion, I’m going to use Firmware Management for UCS Blades as the change we are implementing.

Keep this in mind: this is not the only change that you can manage through this process, it can extend to almost any change you might want to put in place on UCS.

Instead of Modifying, Try Creating and Adding a Date/Time Stamp

In this example we are going to create a new Firmware Management Policy (previous version was 2.1.1f, new version is 2.1.2a).  To keep with the date stamp theme, we create a firmware management policy with a name of 20130901-hv_fw, which references a the blade firmware package of version 2.1.2a, as shown below.  image

For the example documented here, we have previously created one (named 20130801_hv_fw), and we created a new one as mentioned above.  I will reference these for the rest of this post.

Most would typically just go change the service profile or updating template and move on.  However, this would only exert a control system at one level, not at the root level for the workload, where we would find the most useful benefits of configuration management, and we would gain low level control, but not maintain high level control.  Let’s not stop there with version control.

Templates Can Be Your Friend

Now we will take a service profile that is currently impacted by the business or technical issue, right click and create an updating service profile template.

Side note: In this and in all select-click actions, you can right click on navigation pane on the left side, or you can use one of the context links in the content pane on the right hand side of the UCS manager.

In our example I’ll use a service profile named hv_0 as our primer, which is a service profile created for a Hyper-V workload.  This primer is the workload that we used to test the configuration, and once tested and verified, then we can use that as the model for the rest of the Service Profiles. We can make experimental changes, including the firmware policy, to this Service Profile in our test environment, test it out, then use it as a reference.  You can see here that we have used the Firmware Policy labeled as 20130801_hv_fw.image

Once we have done this, it’s very easy to create replicas.  First we create a Service Profile Template by right clicking and selecting “Create Service Profile Template”.

image

Which we will configure as an Updating Template, functionality that we will use later.

image

image

This action takes only a few seconds, and once we have that Template, we can right click it to create the directly associated Service Profiles.  In this example we will create 3 more Hyper-V host workloads, all with identical configurations, BIOS configurations, Firmware Versions, etc. as shown below, using the same naming convention we employed on the first (hv_0).

image image

Now that we have created these new Service Profiles, you will notice something different from the original, as shown below.  These service profiles are not directly modifiable, but rather are bound to the Template and must be either unbound or configured indirectly through the template.

image

If we look at hv_0, however, we will see that is not the case, and that Service Profile is directly modifiable, as it’s not bound to a template.  To maintain consistency, we can bind that to the Template we created, by right clicking the Service Profile hv_0, and clicking “Bind to a Template”, and then choosing the existing template (20130801_hv_gold).

image

image

Now we have a complete set of bound Service Profiles that allow provide us with a solid base for consistent configuration.

Now Comes the Change

We have built out the base model, but now comes the need for the configuration change.  As mentioned before, in this example we are changing the Firmware versions.  Let’s create a new Firmware Policy by choosing the Firmware Management Package from the Servers Tab in the UCS Manager GUI.image

We now have a new Firmware Policy that we can use for our new image testing.  In this example, it’s been a month since we first created our versioning system, so we’re going to label our new Firmware policy as 20130901_hv_fw.

The first thing we should do is test this out, and the best way to do that is to grab one of our Service Profiles and make the changes to that one.  To begin this process, we take one host out of production, then we unbind that Service Profile from the Template as shown here.

image

Now we can directly modify that Service Profile for our process.  Now create a new Firmware Policy, in this case, called 20130901_hv_fw, which references the new Firmware version.  image

Then we can modify the Service Profile to reference that Firmware Policy.  image

Since this is modification of a of an existing Service Profile, we have to commit those changes by clicking “Save Changes” at the bottom right.

image

When we make this change, be aware that the Service Profile will need to reboot the server to update the Firmware, which UCS considers a “Maintenance Activity”.  We have our Service Profile (and thus our Service Profile Template) using a “user-acknowledged maintenance policy”.  This means when a maintenance activity is required, it will queue and UCS Manager will wait for a user to acknowledge the activity before rebooting the Service Profile.  We will get notified of this with something similar to this message:image

If we click Yes, we will also get some other messages indicating that there are pending maintenance activities.  On a Windows machine you may see something like this in the system tray:

image

On any other OS you won’t see a pop up, but you will notice the Pending Activities indicator start flashing red-to-yellow at the top of the UCS Manager window (this happens on Windows as well, but in windows you get multiple notifications).

image

If we click that, we will then see the following Pending Activities list:image

By clicking the check box “Reboot Now” as indicated above, we will reboot the Service Profile and the Firmware update will take place.  You can watch this happen by clicking on the FSM (Finite State Machine) tab and watch the steps as they take place.

Templates Are Your Friend, Again

We now can take the newly modified, rebooted, and tested Service Profile and create a new known-good template.  Once again, right click and select “Create Service Profile Template”.  In our example, we’re creating an updating template with the name 20130901_hv_gold.

image image image

And you can now see we have very quickly created a second Service Profile Template.

My Kingdom for a Trouble-Free Maintenance Window

We now have our existing template, our test machine that we have used to verify proper operation, then moved back to production.  We also have our newly minted template, and now we need to apply this to the production workloads.  An important question to be considered is when and how to do this.  My suggestion would be to roll these during a maintenance window, and the impact of such a maintenance windows will obviously depend on the workload you’re managing.  Bare metal, non-clustered servers are a bit more impactful than virtualized hosts.  You should be able to determine the possible impact and plan accordingly.

Let’s assume that we have procured the maintenance window necessary and it’s time to roll our new Firmware into the rest of the environment.  We can now highlight all of the affected Service Profiles by shift-selecting all of the Service Profiles in our set (hv_0, hv_1, hv_2, and hv3), right clicking the set, and choosing “Bind to a Template”.image

Choose the new Templateimage

Which will then give us the message informing us of our maintenance policyimage

Yielding the new pending activities list image

Something to note here is that hv_0 is not in this list.  Since we have already gone through the process during our testing, it’s Binding to the new Template will not require any maintenance activities.  A suggestion here is to choose a host, start maintenance mode and wait for any VMs to migrate off.  Once that is done, you can come to this window, select the host Service Profile, check the “Reboot Now” box, then hit Apply (or you can hit OK).  This will kick off the maintenance activity that is required to update the Firmware.  Once that host is finished, stop maintenance mode on that host, then move to the next host, lather, rinse, repeat, and so on until you are done with the cluster.

As a side note: if you wanted to automate these maintenance activities, check out some of the awesome work done by Eric WIlliams, a slammin good coding dude at Cisco, as evidenced by some of his work here at the developer.cisco.com community forums.

What About the “Oh Snap” Factor?

Yeah, well, I think we know exactly what I really meant there, but it’s a good and important question, no matter how badly phrased my PG-rated version is.  This is where we can utilize our previous-versioned Templates for Configuration Rollback.  Let’s say we went through all of this and there just so happened to be a service impacting problem that we didn’t catch in our testing (blame QA, they’re used to it).  While this is certainly not something we want to have to deal with, it’s something that we can easily do.

Let’s follow the same procedure we did to bring all of the Service Profiles up to the new version, just in reverse.  If we want to roll back to what in our example is our last known good, we can shift-select all of the Service Profiles, right click, and select “Bind to Template” again, choosing our old stand-by, 20130801_hv_gold.image

image

Of course, we will be prompted with our notice of what maintenance activity this will entail.

image

Then we will come back to our Pending Activities list, this time with all of our affected hosts in the list.  Depending on the maintenance window you worked out, you can follow your maintenance schedule as before by selecting one host at a time using host maintenance modes to move workloads around and selectively rebooting a host at a time.  image

Once again, you can also utilize an automation script, or just say the hell with it and reboot them all at once.  If you choose this last one, please clear your browser history, pretend you never heard of me and freshen up the resume.  Don’t say I didn’t warn you.

If you wan to monitor the status of the changes (in this step or any other when the server is in the throes of a maintenance activity, you can click on the “FSM” tab and watch the progress as well as the step-by-step details as the process is going on.  If you have reached step 38 listed below (as of version 2.1.1), you are beginning the process of the Firmware Updates, starting at 38 with the BIOS Image Update.image

On the Usefulness of Firmware Policies

So, as a footnote, I am a HUGE fan of using Firmware Policies, and consider their use self-evident, however I commonly have to field the question, “why bother?”  One simple experience that I like to fall back on from many years of previous experience…  When have you EVER gotten a replacement server during a hardware failure and replacement that had the EXACT same firmware as the server you are replacing?

Yeah.  Exactly.

Thanks for reading.  See you next time.

— Loy

Follow me on twitter @loyevans

Nexus 9508 NX-OS Upgrade Procedure

$
0
0

So, if you haven’t heard, I’m in a different role at Cisco. I’m now working on our Nexus 9000 (N9K) and ACI initiatives. It’s an exciting time to be here and feels similar to the launch we did with UCS 5 years ago (which was more fun that one human should have been allowed).

Anyway, I’ve been working with a new Cisco Nexus 9508 in the lab this week and needed to upgrade the BIOS and NX-OS software on it. It’s part of the new Nexus 9000 portfolio of switches we started shipping in November 2013. The modular line is the Nexus 9500 and fixed line is the 9300. Anyway, I thought I’d record the process because someone else might want to do the same. One thing to keep in mind is that if you are running NX-OS release 6.1.(2)I1.(1) and are upgrading to a release that includes BIOS updates, the update process will be noticeably slow during the BIOS section. This problem is fixed for any future updates once you are running 6.1.(2).I2(1) or higher.

This process is very similar to other Nexus platforms, but not identical. The N9K platform has some key advantages when it comes to upgrades. For starters, there is a single image file used for everything. No more kickstart + system image files that have to match and be maintained. Further, if you download the single image file for the 9508, you now have the image file any Nexus 9000 switch we make (including the 9300 ToR series). Pretty cool stuff.

This video speaks for itself – a pretty simple process to follow, but I thought some written instructions would help. In addition, I’m going to point you to the official Cisco release notes for the version I used. Be sure and always check the release notes for any version you download prior to install. We sometimes introduce better and/or different ways to perform the upgrade and you don’t want to miss something. The release notes can be found here: http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/6-x/release/notes/61_nxos_rn.html

The abbreviated procedure steps for this specific upgrade (I1 to I2) are:

Step 1 Copy the n9000-dk9.6.1.2.I2.1.bin image to bootflash.
Also copy the EPLD .gimg file to bootflash.
Step 2 Change the boot variables to the NX-OS image by entering the following commands:
switch (config)# boot nxos bootflash:n9000-dk9.6.1.2.I2.1.bin
Step 3 Enter the copy running-config startup-config command to set the startup boot variables to the NX-OS image.
* THIS STEP IS VERY IMPORTANT – do not forget or your BIOS upgrade will take much longer (like mine did).
Step 4 Copy the running-configuration file to a backup-configuration file to ensure that you load the running configuration after you make the upgrade.
Step 5 Enter the write erase command. The boot variables remain set (this is why you made a backup).
Step 6 Enter the reload command.
On a Cisco Nexus switch with dual supervisors, an “Autocopy in progress” message might appear when you enter the reload command. Enter No and wait for the auto copy operation to finish.
Step 7 Wait 2 minutes after the reload for all modules to come online before proceeding to the next step.
Step 8 Enter the install all nxos bootflash:n9000-dk9.6.1.2.I2.1.bin command to upgrade the BIOS. If you have successfully booted of I2 NXOS code, the chassis will not reboot at the completion of this step. If not, it will be disruptive and reboot the chassis. Do not attempt to reboot or power off the chassis during this operation. If it reboots, wait 2 minutes after the reload for all the modules to come online before proceeding to the next step.
Run the command “show ver” to verify you are running 6.1.2.I2.1.bin
Step 9 Enter the “install epld bootflash:n9000-epld.6.1.2.I2.1.gimg module all” command to upgrade the EPLD. The chassis will reboot automatically. This is disruptive regardless.
Step 10 Wait 2 minutes after chassis reload and then enter the “install epld bootflash:n9000-epld.6.1.2.I2.1.gimg module all golden” command to upgrade the golden EPLD. The chassis will reboot automatically.
Step 11 Restore the configuration that you saved in Step 4.

Note: I make one mistake in this video that’s not critical, but causes the bios upgrade to take longer than needed. I failed to copy running config to startup after I set the boot variables to I2. This makes the reload I do worthless and it came back up on I1.


Thanks for stopping by

-Jeff

Fabric Interconnect booting to bash?

$
0
0

So, I thought I’d share an experience I had yesterday where my Cisco UCS Fabric Interconnect (FI) wasn’t feeling well and in my attempt to resurrect it, I seemed to break it even more. I’m sure that never happens to you…The FI was now booting to a bash prompt instead of the normal UCS console interface. It would get to the point where it would say “System is coming up….Please wait” and it would say this about 12 times and then display the bash prompt. I won’t bore you with what I actually did in my attempt to get beyond this, but lets just say I spent about 2 hours debugging it when the fix should have only taken about 5 minutes (hindsight is 20/20). It goes without saying that this situation should not happen under normal circumstances, but I’ve heard rumblings of people seeing this here and there after upgrading to 2.2.x. So if Google brought you here looking for a solution, you’re in luck.

All you need to do is:

bash# shutdown -r now

As the FI boots, press Ctrl + r to interrupt the boot process

loader>dir

Get the UCS kickstart file name – it would be preferable for you to boot the actual file name that the FI has in the /installables/switch folder which is the name of the kickstart image that you last installed. This can be found by looking at the working FI and running “scope firmware” and then “show image” (that’s from memory but I think that’s it). However, you can use the kickstart in the root if you can’t figure it out.

loader> boot ucs-6100-k9-kickstart.5.0.3.N2.<xyzabc>.bin

When the system comes up and sits are the boot prompt, run “erase configuration”

switch(boot)# erase configuration

That should do it. The FI needs to reboot and come back up as if it were brand new and ask to create/join the cluster.

Hope this saves someone some time.

-Jeff

The Cisco Nexus 9000 – 10 Cool Features

$
0
0

So, the smart guys and girls in San Jose have been working day and night (literally) to bring you Cisco’s latest Nexus switch family – the Nexus 9000. It includes both the modular 9500 and the Top-of-Rack 9300. I wrote down 10 of coolest features that came to mind that I feel are worth sharing.

  1. There is no midplane. That’s right, if you remove the line cards, fans, and fabric modules, you would leave a hole large enough to pass a small child through – perhaps a hobbit even. You can literally see right through the chassis. Why would we do this? Well, the midplane introduces two challenges to designing a chassis because a) the midplane will need to be replaced for a technology shift from 40G to 100G and b) even though it’s extremely rare, a bent pin on a midplane is a pretty large service event. In a previous life, I worked for a manufacturer where I learned about a contagious datacenter pandemic called “bent pin disease”. It happens when a pin is bent on a midplane and someone inserts a device into that slot. That device is forced into place with the bent pin causing damage the connector on the device itself. The operator realizes it’s not fitting properly and removes it and tries it in another chassis (troubleshooting 101). Because the device has a messed up connector, it damages the pins on the second chassis and this is how the disease spreads from chassis chassis and device to device. Very painful. No midplane – no problem.
  2. Native 40G technology. Need I say more?
  3. Common Image – Both the modular (9500) and Top-of-Rack (9300) boot from the same image file. How is this a benefit to you as the customer? Well, if we find and fix a bug on the 9500, there is no lag in the bug being fixed on the 9300. Pretty cool.
  4. Single System Image – When I said image “file” in #3 above, I meant exactly that – file, as in a singular file. No more kickstart and system images that are a pain to find and match when a switch is down and time is tight.
  5. Improved Patching – We can now patch a bug to an executable or a library inside the image without a whole new system image. This should speed the time to release updates.
  6. XMPP support – Add your switches to your favorite IM client (jabber, messenger, AIM, etc) and simply send an IM to a switch to collect info like “show run int eth 0/1” and get the results instantly.
  7. SMTP Destinations – Tired of logging your SSH session, collecting a switch config, and then attaching the log file to an email? Now you can simply run a command like:
    show run | email <from> <smtp-host> <subject>
    and the switch will email the results of the command to the address specified. You can also pre-define the email configuration using the command ‘email’ while in configuration (conf t) mode.
  8. Linux BASH Shell access – you heard that right. It’s no secret that the OS running under Nexus is a hardened linux. And now we’ve given you access to the actual bash shell so that you can do things like cron jobs, check available system resources (meminfo, ps), etc. From configure mode, type:
    feature bash
    then from exec mode type:
    run bash
  9. In a single 9508 chassis, you can have 288 40G ports or 1152 10G ports. This is achieved via the ability to take a single 40G port and split it into 4 distinct 10G ports using a special break-out cable.
  10. 40G “BiDi” (pronounced bye-dye) – a Cisco exclusive offering that gives you 40G speeds over existing MMF installations in your DC. No need to rip out all you existing fiber to run 40G! These new optics allow you to upgrade your Ethernet network to the latest 40G technology and not bear the burden of new fiber runs. And if that savings isn’t enough, I bet you’ll be pleasantly surprised by the price of these optics!

Thanks for stopping by. I hope to find time to write more in the coming months and keep you updated of all the cool technology available in this new line of datacenter switches.

-Jeff


Erasing a single UCS FI

$
0
0

So, I recently noticed that nowhere on the web (that I could find) is it documented what happens when you run “erase configuration” on a single Fabric Interconnect that is part of a cluster. Does it erase the configuration on just that one FI or does it erase the whole UCS “system” as the command warning says? I know the suspense is killing you….Well, it’s just the single FI an it leaves the other FI and the full configuration intact. This is useful if you need to rebuild the config of a single FI that is part of a cluster.

Cool New UCS Feature–Firmware Sync

$
0
0

So, I had to fix a broken UCS fabric interconnect the other day and I wrote about it here. In that experience, I came across a new feature that is pretty cool. I wish I could say which version of UCS introduced this, but I’m not that “close” to each UCS release these days. But I know it works if the FI is running at least 2.1.3 because that is what I was on.

It used to be that when you attempted to cluster two FI’s together, it required the second FI wanting to join to be at the exact same firmware level. And if it wasn’t already, you had to bypass the cluster, boot standalone, upgrade the firmware, erase the config and then join the cluster. Huge pain. That process is no longer required – now when you attempt join with mis-matched firmware, the Fi will prompt you to sync the firmware on the FI to the one that is running the cluster. Very easy and very cool.

-Jeff

UPDATE: The firmware sync process doesn’t work between different model FI’s. So if you are attempting to upgrade from a 6100 to a 6200 (or future FI), you won;t be able to sync the firmware. The upgrade process is outlined here: http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/upgrading/from2-0/to2-0MR/b_UpgradingCiscoUCSFrom2-0To2-0MR/b_UpgradingCiscoUCSFrom1-4To2-0_chapter_0101.html#concept_D064631BE63A4073BAB2F27D2E8693D0

When Disaster Strikes…

$
0
0

So, everyone would agree that a helping hand is nice to have now and then. Like the time I thought it would be a good idea to skateboard while holding onto my brother’s car as he drove down the street. It was his helping hand reaching down to pick me up off the road (bleeding) and sitting me in the car that I won’t forget (I still have that scar on my hip). It was brother helping brother – an understanding that when one is down, the other will help get him on his feet (hopefully before mom sees so that we could get our story coordinated as to how it happened). In the UCS world, the brothers in this scenario are the Fabric Interconnects (I’m not sure who the mother is).

There are times when a Fabric Interconnect might encounter a software failure – for whatever reason, and land at the “loader” prompt. It’s rare, but it can happen. The loader prompt is a lonely place and it’s not pleasant. The good news is that if you still have a single FI working, you can use it to resurrect the broken FI. First off, if you ever find yourself staring at the loader prompt, stop cursing and just try to unplug it and plug it back in. Don’t worry if “dir” shows no files – just try it. I’ve seen it work and the FI comes right back on the next boot. If that doesn’t work, you have some work to do…

The loader is just that – a loader. It’s “loads” an OS – like a bootstrap. You need 3 files to permanently get out of the loader – kickstart, system, and the UCSM image. Luckily all of these live on your remaining FI. The bad news is you can’t get to them without bringing it down as well. So, if you’re in production and can’t get afford to bring down the entire UCS pod, you should stop reading and call TAC. They can get you the 3 magic files you need and can get it all running without bringing anything additional offline. But if you’re in a situation where you can afford to take down the remaining FI, you can fix this problem yourself.

To make this work, you will need:

  • Non-functional FI
  • Functional FI
  • FTP Server
  • TFTP Server

Your basic recovery will include:

  1. Disconnect the L1/L2 cables between the FI’s to avoid messing up the cluster data they share
  2. Boot FI-A to loader
  3. Force FI-B to loader
  4. Boot kickstart on FI-B
  5. Assign IP address to FI-B
  6. FTP kickstart, image, and ucsm images from FI-B to an FTP server
  7. Reboot FI-B back to its normal state
  8. Get kickstart image onto TFTP server (unless FTP/TFTP are the same server)
  9. Boot kickstart image on FI-A via TFTP server
  10. FTP kickstart, system, and ucsm images down to FI-A
  11. Copy ucsm image file to the root
  12. Load system image on FI-A
  13. “Activate” the firmware on FI-A
  14. Connect L1/L2 cables back and rejoin the cluster

Reboot the “good” FI (known in this document now as FI-B), and begin pressing CTRL+R to interrupt the boot process. You will find FI-B now stops at the loader prompt too. Now type

boot /installables/switch/ <tab>

which will show you all files in this folder. You are looking for the obvious kickstart file and you want the latest one. To make the display easier to read, I would type this:

boot /installables/switch/ucs-6100-k9-kickstart <tab>

Backspace is not an option so if you make an error, use the arrow keys and the “delete” key to fix typos.

Select the latest image, hit enter, and FI-B now beings to boot the kickstart image. Give it a few minutes and you should find it stops at the “boot” prompt. This prompt is not as lonely as the loader prompt, but it’s still not a fun place to be (at least you can backspace now). You actually will have much more functionality then you did with loader, but won’t need it for this exercise. At this point you need to assign and IP address to FI-B so that you can FTP the kickstart image to an FTP server. The commands will look like this:

#Config t

#int mgmt 0

#ip address X.X.X.X <mask>

#no shut

Wait 10-15 seconds

#<ctrl+z to return the shell to the top level>

# copy bootflash:installables/switch/ucs-6100-k9-kickstart <tab>

Select the latest version and copy it to the FTP server.

DO NOT USE THE FILES IN THE ROOT OF BOOTFLASH AT ANY TIME DURING THIS PROCESS. Nothing catastrophic will happen, but the FI will not boot in the end.

The shell will prompt you for ftp server address and credentials and it should look something like this:

You need to allow about 10-15 additional seconds after you “no shut” the interface for the IP to become active and useable.

You now need to copy the system and UCSM images as well as you will need them soon enough. The other two files will look something like:

installables/switch/ucs-manager-k9.2.1.0.418.bin

installables/switch/ucs-6100-k9-system.5.0.3.N2.2.10.418.bin

again – your versions will be different

Once you are returned to the boot prompt, and all 3 files are copied, the kickstart file is on the FTP server. You should boot FI-B back into production. You now need to get the kickstart file available via TFTP using whatever process you do to make that happen. One word of caution here – TFTP blows. It runs on UDP, it’s slow, and it has no error checking. If your first attempt at booting fails, try a different TFTP server program (trust me on this – I had bruises on my head from banging it on the wall). Once the file is available via TFTP, return to FI-A which is at the loader prompt. You will now boot that kickstart image via TFTP using these commands:

Incidentally, you cannot ping this address from an outside station. Just FYI

Then it begins loading the image. It should take just a few seconds to actually start booting. FI-A will not land at the boot prompt like FI-B did earlier. You need to rebuild the filesystem on FI-A, so type:

#init system

This will take a few minutes. When it’s done, you can now use FTP to copy the 3 files down to FI-A. Use this command to retrieve each file:

#Copy ftp: bootflash:

The shell will prompt you for everything it needs to copy the files.

After all 3 are copied, one very important command needs to be run now and it won’t make sense, but you must do this. You need to run this command:

Copy bootflash:/ucs-manager-k9.2.1.0.418.bin bootflash:/nuova-sim-mgmt-nsg.0.1.0.001.bin

The nuova-sim-mgmt-nsg.0.1.0.001.bin is an exact name that is needed here.

Now that you have all 3 files local on the FI, you would be able to recover much quicker if the FI were to lose power. At this moment, if that happened, you would be returned to the loader prompt, but you would be able to boot via bootflash instead of TFTP. Anyway, you are now at the boot prompt and need to finish booting. Type load bootflash://ucs-6100-k9-system.5.0.3.N2.2.10.418.bin. This will start loading the system image and when it’s done loading, it will look for the UCSM image that you also copied and the FI should come up. It will walk you through the setup menu and since the L1/l2 cables are not connected, I would go ahead and set it up as standalone – we will join it to the cluster soon. Once you are logged into the FI, you need to activate the current firmware to set the startup variables. The easiest way to do this is in the GUI. Just go into Firmware Management and select “Activate Firmware” and select the FI. You will likely see that no version is in the startup column. Regardless, you need to activate the version that is already there. If it doesn’t let you, exit Firmware Management and navigate to the Fabric Interconnect on the left-side Tree menu and activate the firmware from there using the “force” option. This will fix up the ucsm image file that we copied to the root as well (turns it into a symbolic link).

That’s about it. You should be OK to erase the config on FI-A (#connect local-mgmt), hook up the L1/l2 cables and rejoin the cluster on the next reboot. I really hope you don’t need to ever use this… I mainly wrote this blog for myself because in the lab we do a lot of crazy stuff and I often forget a step here and there. So I wanted it all written down to refer back to and I’ve wanted to get this one done for quite some time.

Thanks for reading…

-Jeff

Resetting UCS to Factory Defaults

$
0
0

So, way back in early 2009, Sean McGee and I decided to work over the weekend in San Jose to get more stick time with “Project California” as UCS was called then. We borrowed a system from someone, backed it up, and started discovering how UCS worked. We had no help locally since it was a weekend and one thing I wanted to know was how to erase the configuration and start over. We were still months away from documentation and the online help inside the pre-1.0 UCSM was very incomplete. We eventually did figure out how to erase the configuration and start over, but we had to stumble upon it. Resetting UCSM is a well-documented process now, but I thought I’d write this post to cut through the pre-requisites and making sure proper backups are done, etc. I just wanted to give you the commands to get the job done. You’re on your own to make sure you really want to do this.

In another blog post I covered how to restore a failed Fabric Interconnect (FI). It gives you some insight into a complete and total rebuild of an FI in a worst-case scenario. While that would accomplish the “factory defaults” you desire, it’s a painful way to get there. Thankfully, the “erase” process is pretty easy. There is no way to do this in the GUI so grab your favorite ssh client and connect to either FI. Once connected, type the following:

FI-A # Connect local-mgmt

FI-A (local-mgmt)# erase config

That’s it! You’ll need to confirm the command before it executes, but that will start the process. You then need to repeat the process on the other FI. There is a way you could erase both of them by connecting to the VIP and not directly to the FI, but I’m going to cover that feature in another post because it’s pretty cool all by itself.

-Jeff

UCS Command Line Shells

$
0
0

So, about 2 years ago I was with a customer who had opted to purchase UCS over their incumbent HP hardware for their private cloud build. As a first step, we upgraded the firmware on the UCS system. What I did not know at the time was that the mgmt0 cable plugged into the “B” Fabric Interconnect (FI) was showing link, but was not on the right vlan (or wasn’t passing traffic). When it came time in the upgrade to failover the management instance of UCSM to the “B side”, we lost access completely to UCS manager. This and other seemingly related events (but were actually totally unrelated in hindsight) led me to believe that UCSM had failed in some manner and started me down a multi-hour troubleshooting session that I really wished had never happened. I opened an enhancement request to allow UCSM to detect this situation in the future and move UCSM back to the originating FI if it is unable to find the default gateway. Had I known this trick that I am about to tell you concerning the UCS shells, I might have been smart enough to get out of my situation much faster. The sad thing is I actually did know this – it was just knowledge from so early on in my UCS learning curve that I didn’t fully absorb the importance of it. So, now is your chance to start absorbing…

If you have spent any time around UCS (and if you are reading this, you probably have), you know that there is a command-line interface in addition to the provided GUI. The actual “UCS” command line is the starting point “shell” that you are automatically in when you ssh to the UCSM Virtual IP (VIP). We’ll refer to this as the root shell for the purposes of this document. Although root is the main shell, there are many sub-shells available to you in UCS that accomplish various tasks. This post will focus on accessing two specific sub-shells, local-mgmt and NXOS. This article assumes you have knowledge of what each of these shells is for and will not discuss the details of these sub-shells, but will give you an understanding of how to navigate the root shell to gain access to these other sub-shells.

It helps if you think of the shells in hierarchical manner (such as the graphic above). As I mentioned, there are additional sub-shells beyond what are listed above, but NXOS and local-mgmt are by far the most-used, and they are unique in how you can access them. Because the root shell sits above the sub-shells of both fabrics, it allows you to access either sub-shell of either fabric (assuming you are connected to the UCSM VIP and not an individual FI). For instance:

Notice that I started out on Fabric B because that was the controlling instance (FI) of UCSM (you can flip the controlling instance back and forth without data plane disruption – a post for another day). While on Fabric B, I typed connect local-mgmt A. The UCSM root shell then connected me to the local-mgmt sub-shell on fabric A. Had I typed just connect local-mgmt (omitting the “A”), it would default to the fabric that the VIP is currently on (in this case, B). From the root shell, you can do the same type of connection to the NXOS sub-shell on either fabric as well. You cannot jump from a sub-shell to any other sub-shell. You must “exit” back to the root shell to enter any sub-shell.

Back to my bad day story…had I remembered this trick, how would I have avoided the issue? Well, I could always access the A Fabric Interconnect. From there, I could have run connect local-mgmt B and
accessed UCSM which was running just fine on Fabric Interconnect B, and flipped UCSM back to Fabric Interconnect A using local mgmt commands. The success in doing that would have instantly led me to the mgmt0 connection on the B fabric. Things like this are much easier to spot the second time around though – and I saw it again at a customer in production who had a faulty connection to FI-B. In that instance, fixing it was really easy (and they thought I was really smart – no, I didn’t tell them the truth).

That’s pretty much all there is to it. If you want to play around with the various other shells, you can type connect ? at the root shell and it will return all the possible devices you can connect to.

 

P.S. Ironically, the same day I wrote this article, I got a call from a co-worker who “could not connect back to UCSM after the primary FI rebooted during a firmware upgrade”. We used this trick (which he thought was way cool) and then discovered later that he had a flaky Ethernet cable in mgmt0 in the (formerly) subordinate FI. If you’re curious about why the enhancement I referenced above didn’t help here, it’s because the enhancement (mgmt0 interface monitoring) is enabled by default on all NEW installations but left at the previous setting on any UPGRADES (because change is a bad thing). I believe that change went into the 2.0 release.

 

Thanks for your time.

 

-Jeff

 

ENM Source Pinning Failed – A Lesson in Disjoint Layer 2

$
0
0

So, today’s article is on VLAN separation, the problems it solves, and the problems it sometimes creates. Not all networks are cut from the same cloth. Some are simple and some are complex. Some are physical and some are virtual. Some are clean while others are quite messy. The one thing that they all have in common is that Cisco UCS works with all of them.

A Look Back

In UCS, we have a topology that we call “disjoint Layer 2” (DJL2) which simply means that a there are networks upstream from UCS that are separated from one another and cannot all be accessed by the same UCS uplink port (or port channel). For instance, you might have upstream VLANs 10, 20, and 30 on UCS uplink port 1 and VLANs 40, 50, and 60 on UCS uplink port 2. Prior to UCS 2.0, this configuration was not supported (in End Host Mode (EHM)). The main reason is that prior to 2.0, when VLANs were created, they were instantly available on ALL defined uplink ports and you could not assign certain VLANs to certain uplink ports. In addition to this, UCS uses the concept of a “designated receiver” (DR) port that is the single port (or port channel) chosen by UCSM to receive all multicast and broadcast traffic for all VLANs defined on the Fabric Interconnect (FI). To make this clear, UCS receives all multicast/broadcast traffic on this port only and drops broadcast/multicast traffic received on all other ports. Unless you have DJL2, this method works really well. If you do have DJL2, this would lead to a problem if you defined the above VLAN configuration and plugged it into pre-2.0 UCS (in EHM). In this situation, UCS would choose a designated receiver port for ALL VLANs (10-60) and assign it to one of the available uplinks. Let’s say the system chose port 1 (VLANs 10, 20, and 30) for the DR. In that situation, those networks (10, 20, 30) would work correctly, but VLANs 40, 50, and 60 (plugged into port 2) would not receive any broadcast and multicast traffic at all. The FI will learn the MAC addresses of the destinations on port 2 for 40, 50 and 60, but necessary protocols like ARP, PXE, DHCP (just to name a few) would be broken for these networks. In case you’re wondering, pin groups do not solve this problem so don’t waste your time. Instead, you need UCS 2.0+ and DJL2 which allows specific VLANs to be pinned to specific uplink ports. In addition, you now have a DR port for each defined VLAN as opposed to globally for the each FI. If you want to know more about the DR port, how it works, and how you can see which ports are the current DR on your own domain, please see the Cisco whitepaper entitled “Deploy Layer 2 Disjoint Networks Upstream in End Host Mode” located here: http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns944/white_paper_c11-692008.html

The Rules

You’ve probably figured out that if this were super easy, I wouldn’t be writing about it. Well, yes and no. It’s easy to turn on the DJL2 feature, but there are some lesser known rules around making it work. There is no “enable DJL2” button and you won’t find it by that name in UCSM. You simply enable it when you assign specific VLANs to specific uplink ports. It’s then automatically on. But many people make a mistake here. Staying with the above example, you want port 1 to carry VLANs 10-30 and port 2 to carry VLANs 40-60. When you first enter VLAN manager, you will see VLANs 10-60 defined and carried on ports 1 and 2. You might think to just take VLANs 40-60 and assign them to port 2. Well, that does remove 40-60 off of port 1, but it would also leave 10-30 on port 2 (along with 40-60). So you must isolate VLANs to their respective uplink ports. Furthermore, if you define a new VLAN, you need to go into VLAN manager and pin it to the port(s) you intend and remove it from the ports it should not be on. The main thing to remember here is that the original UCS rules on VLAN creation have not changed. That is, a created VLAN is always available on all uplink ports. That still happens even when you have DJL2 setup because UCS manager has no idea where to put that new VLAN unless you tell it – so it follows the original rule. I recommend looking at your VLAN config in NXOS (show VLAN) before you run it in production. This will verify that the changes you wanted to make are truly that changes you made in the GUI.

ENM Source Pinning Failed

So now we have DJL2 setup properly on our uplink. Let’s look at the server side as it is often an area of confusion. It’s probably also the way most of you found this blog entry because you googled for the term “ENM Source Pinning Failed”. Let me explain why. When you create vNICs on a service profile using the config we had above (10-30 on port 1 and 40-60 on port 2), you are not able to trunk/tag VLANs from BOTH port 1 and port 2 to the same vNIC. For example, you can have have a single vNIC with VLANs 10, 20, and 30 and another vNIC with VLANs 40, 50, and 60 and both vNICs can be on the same server. But you CANNOT have a single vNIC with VLANs 10 and 40. If you do, the vNIC will go into an error state and will lose link until one of the VLANs is removed. The picture below might help – keep in mind that this diagram is very simplistic and that you can also get an ENM source pin failure with just a single FI:

The above illustration shows a configuration where the user wants to have VLANs 10-50 reach a single server, but this will not work in a DJL2 configuration and will result in ENM Source Pin Failure. Instead, the illustration below would achieve the desired result of VLANs 10-50 reaching the same server, but do not violate the DJL2 rules and would work fine.

Hopefully this helped explain DJL2 a little better and maybe alleviate the ENM Source Pinning error you might be getting.

Thanks for stopping by.

-Jeff

Update: I am running UCSM version 2.2.1d in the lab at present and came across a scenario I need to share. I had vlan 34 on FI-A and vlan 35 on FI-B. I did not need failover for this so each vlan was isolated to a single FI. I set up disjoint L2 correctly and “show vlan” in NXOS mode showed me that it was indeed setup the way I intended. However, any profiles that used vlan-34 on FI-A would throw the dreaded ENM source pin failed error in the profile. I spent several hours verifying and double-checking everything, but no joy. I then ran this command:
FI-A (nxos)# show platform software enm internal info vlandb id 34

I got nothing back. nada, zilch, nothing.
Running this on FI-B, I got what I expected:
FI-B (nxos)# show platform software enm internal info vlandb id 35
vlan_id 35
————-
Designated receiver: Eth1/1
Membership:
Eth1/1

Assuming something was really confused here, I rebotoed FI-A and all was well. If you encounter this, I don’t suggest you reboot the FI (unless you’re like me and it’s in a lab), and I would call TAC instead and let them find a non-disruptive method of correcting the issue. I just want to make a note of it here in case you feel like you got it all right and it still has an issue.

2nd update:

You can run into this problem from the opposite direction if you implement the uplinks first and then create the profiles later. If that’s the case, the profiles will throw an error before being applied saying “not enough resources overall” and “Failed to find any operation uplink port that carries all vlans of the vNICs”

Change Management with Change Tracking, Version Control… and Rollbacks

$
0
0

image

So… what we’re going to discuss here is a method by which you can implement some mechanism for change management, version control and rollback ability in UCS through service profile templates, but first, I’d like to give a little background.

My name is Loy Evans, and I’m a Cisco Data Center Consulting Systems Engineer, like Jeff.  In my past, I’ve held a number of varied jobs in the IT industry from Programmer to Router Jockey to Data Center Master Architect.  For the past few years, I’ve been consulting on UCS for customers in the Southeastern US.  I pride myself on understanding not just what customers ask for, but also the questions behind the question being asked.  This typically leads me to one of two things, either a business need, or a technical issue.  OK, mostly some of both, but there’s always a tendency in one or the other direction, but in my opinion, it’s very important to understand the root of the question, as there will likely be subtle differences in how you approach the answer and maybe even more subtle differences in how you present the solution.  When I talked to Jeff about some stuff I was doing, he thought it would be “in the wheelhouse” of what he considered core content for his blog. “So”… here we go.

Back to the lesson at hand… In this case, the customer brought up an issue that recently happened where they had a couple of changes that had taken place (a change in BIOS configuration, a Firmware update, and some operational network setting changes), and the way in which they had implemented it, they had no idea when the changes were done, or how to track the impact to the service profiles.  These changes were made by modifying policies that were already being referenced by the service profiles, thus making the change management difficult, if not impossible, and the ability to monitor the magnitude and rate of change non-existent.  On top of that, they had no process for implementing the changes in an orderly fashion.  In short, they had a great tool in UCS manager, but were not using it for efficient operational control.

I decided to step back and look at the problem from a little higher viewpoint.  My take on it was first: WHAT problem are you trying to solve, then HOW are you solving it? The answer to the former was simple: we have to adjust the environment to keep up with addressing a business need (adding/removing VLANs to a cluster) and fixing a technical issue (Firmware upgrade to support a new feature or BIOS configuration change to address a hypervisor bug).  The answer to the latter was not so simple.  In this case, they had not really worked out a system, and the implementation of the fixes followed bad form: modify configuration of a policy already in place.  I’d say that’s probably a worst practice.  I guess there is a bit of a gotcha… While UCS Manager is very flexible and you can just edit a policy at will, doesn’t mean you should.  The good news is you have options, the bad news is…you have options.

Screen Shot 2013-10-01 at 4.41.12 PM

So, my suggestion was to begin a practice of version control based on Policies and Templates.  The following is a description of a set of concepts and practices that we put into place, and I now use as a recommended practice to all customers as they look to operationalize UCS in their organizations.  For this discussion, I’m going to use Firmware Management for UCS Blades as the change we are implementing.

Keep this in mind: this is not the only change that you can manage through this process, it can extend to almost any change you might want to put in place on UCS.

Instead of Modifying, Try Creating and Adding a Date/Time Stamp

In this example we are going to create a new Firmware Management Policy (previous version was 2.1.1f, new version is 2.1.2a).  To keep with the date stamp theme, we create a firmware management policy with a name of 20130901-hv_fw, which references a the blade firmware package of version 2.1.2a, as shown below.  image

For the example documented here, we have previously created one (named 20130801_hv_fw), and we created a new one as mentioned above.  I will reference these for the rest of this post.

Most would typically just go change the service profile or updating template and move on.  However, this would only exert a control system at one level, not at the root level for the workload, where we would find the most useful benefits of configuration management, and we would gain low level control, but not maintain high level control.  Let’s not stop there with version control.

Templates Can Be Your Friend

Now we will take a service profile that is currently impacted by the business or technical issue, right click and create an updating service profile template.

Side note: In this and in all select-click actions, you can right click on navigation pane on the left side, or you can use one of the context links in the content pane on the right hand side of the UCS manager.

In our example I’ll use a service profile named hv_0 as our primer, which is a service profile created for a Hyper-V workload.  This primer is the workload that we used to test the configuration, and once tested and verified, then we can use that as the model for the rest of the Service Profiles. We can make experimental changes, including the firmware policy, to this Service Profile in our test environment, test it out, then use it as a reference.  You can see here that we have used the Firmware Policy labeled as 20130801_hv_fw.image

Once we have done this, it’s very easy to create replicas.  First we create a Service Profile Template by right clicking and selecting “Create Service Profile Template”.

image

Which we will configure as an Updating Template, functionality that we will use later.

image

image

This action takes only a few seconds, and once we have that Template, we can right click it to create the directly associated Service Profiles.  In this example we will create 3 more Hyper-V host workloads, all with identical configurations, BIOS configurations, Firmware Versions, etc. as shown below, using the same naming convention we employed on the first (hv_0).

image image

Now that we have created these new Service Profiles, you will notice something different from the original, as shown below.  These service profiles are not directly modifiable, but rather are bound to the Template and must be either unbound or configured indirectly through the template.

image

If we look at hv_0, however, we will see that is not the case, and that Service Profile is directly modifiable, as it’s not bound to a template.  To maintain consistency, we can bind that to the Template we created, by right clicking the Service Profile hv_0, and clicking “Bind to a Template”, and then choosing the existing template (20130801_hv_gold).

image

image

Now we have a complete set of bound Service Profiles that allow provide us with a solid base for consistent configuration.

Now Comes the Change

We have built out the base model, but now comes the need for the configuration change.  As mentioned before, in this example we are changing the Firmware versions.  Let’s create a new Firmware Policy by choosing the Firmware Management Package from the Servers Tab in the UCS Manager GUI.image

We now have a new Firmware Policy that we can use for our new image testing.  In this example, it’s been a month since we first created our versioning system, so we’re going to label our new Firmware policy as 20130901_hv_fw.

The first thing we should do is test this out, and the best way to do that is to grab one of our Service Profiles and make the changes to that one.  To begin this process, we take one host out of production, then we unbind that Service Profile from the Template as shown here.

image

Now we can directly modify that Service Profile for our process.  Now create a new Firmware Policy, in this case, called 20130901_hv_fw, which references the new Firmware version.  image

Then we can modify the Service Profile to reference that Firmware Policy.  image

Since this is modification of a of an existing Service Profile, we have to commit those changes by clicking “Save Changes” at the bottom right.

image

When we make this change, be aware that the Service Profile will need to reboot the server to update the Firmware, which UCS considers a “Maintenance Activity”.  We have our Service Profile (and thus our Service Profile Template) using a “user-acknowledged maintenance policy”.  This means when a maintenance activity is required, it will queue and UCS Manager will wait for a user to acknowledge the activity before rebooting the Service Profile.  We will get notified of this with something similar to this message:image

If we click Yes, we will also get some other messages indicating that there are pending maintenance activities.  On a Windows machine you may see something like this in the system tray:

image

On any other OS you won’t see a pop up, but you will notice the Pending Activities indicator start flashing red-to-yellow at the top of the UCS Manager window (this happens on Windows as well, but in windows you get multiple notifications).

image

If we click that, we will then see the following Pending Activities list:image

By clicking the check box “Reboot Now” as indicated above, we will reboot the Service Profile and the Firmware update will take place.  You can watch this happen by clicking on the FSM (Finite State Machine) tab and watch the steps as they take place.

Templates Are Your Friend, Again

We now can take the newly modified, rebooted, and tested Service Profile and create a new known-good template.  Once again, right click and select “Create Service Profile Template”.  In our example, we’re creating an updating template with the name 20130901_hv_gold.

image image image

And you can now see we have very quickly created a second Service Profile Template.

My Kingdom for a Trouble-Free Maintenance Window

We now have our existing template, our test machine that we have used to verify proper operation, then moved back to production.  We also have our newly minted template, and now we need to apply this to the production workloads.  An important question to be considered is when and how to do this.  My suggestion would be to roll these during a maintenance window, and the impact of such a maintenance windows will obviously depend on the workload you’re managing.  Bare metal, non-clustered servers are a bit more impactful than virtualized hosts.  You should be able to determine the possible impact and plan accordingly.

Let’s assume that we have procured the maintenance window necessary and it’s time to roll our new Firmware into the rest of the environment.  We can now highlight all of the affected Service Profiles by shift-selecting all of the Service Profiles in our set (hv_0, hv_1, hv_2, and hv3), right clicking the set, and choosing “Bind to a Template”.image

Choose the new Templateimage

Which will then give us the message informing us of our maintenance policyimage

Yielding the new pending activities list image

Something to note here is that hv_0 is not in this list.  Since we have already gone through the process during our testing, it’s Binding to the new Template will not require any maintenance activities.  A suggestion here is to choose a host, start maintenance mode and wait for any VMs to migrate off.  Once that is done, you can come to this window, select the host Service Profile, check the “Reboot Now” box, then hit Apply (or you can hit OK).  This will kick off the maintenance activity that is required to update the Firmware.  Once that host is finished, stop maintenance mode on that host, then move to the next host, lather, rinse, repeat, and so on until you are done with the cluster.

As a side note: if you wanted to automate these maintenance activities, check out some of the awesome work done by Eric WIlliams, a slammin good coding dude at Cisco, as evidenced by some of his work here at the developer.cisco.com community forums.

What About the “Oh Snap” Factor?

Yeah, well, I think we know exactly what I really meant there, but it’s a good and important question, no matter how badly phrased my PG-rated version is.  This is where we can utilize our previous-versioned Templates for Configuration Rollback.  Let’s say we went through all of this and there just so happened to be a service impacting problem that we didn’t catch in our testing (blame QA, they’re used to it).  While this is certainly not something we want to have to deal with, it’s something that we can easily do.

Let’s follow the same procedure we did to bring all of the Service Profiles up to the new version, just in reverse.  If we want to roll back to what in our example is our last known good, we can shift-select all of the Service Profiles, right click, and select “Bind to Template” again, choosing our old stand-by, 20130801_hv_gold.image

image

Of course, we will be prompted with our notice of what maintenance activity this will entail.

image

Then we will come back to our Pending Activities list, this time with all of our affected hosts in the list.  Depending on the maintenance window you worked out, you can follow your maintenance schedule as before by selecting one host at a time using host maintenance modes to move workloads around and selectively rebooting a host at a time.  image

Once again, you can also utilize an automation script, or just say the hell with it and reboot them all at once.  If you choose this last one, please clear your browser history, pretend you never heard of me and freshen up the resume.  Don’t say I didn’t warn you.

If you wan to monitor the status of the changes (in this step or any other when the server is in the throes of a maintenance activity, you can click on the “FSM” tab and watch the progress as well as the step-by-step details as the process is going on.  If you have reached step 38 listed below (as of version 2.1.1), you are beginning the process of the Firmware Updates, starting at 38 with the BIOS Image Update.image

On the Usefulness of Firmware Policies

So, as a footnote, I am a HUGE fan of using Firmware Policies, and consider their use self-evident, however I commonly have to field the question, “why bother?”  One simple experience that I like to fall back on from many years of previous experience…  When have you EVER gotten a replacement server during a hardware failure and replacement that had the EXACT same firmware as the server you are replacing?

Yeah.  Exactly.

Thanks for reading.  See you next time.

— Loy

Follow me on twitter @loyevans


Nexus 9508 NX-OS Upgrade Procedure

$
0
0

So, if you haven’t heard, I’m in a different role at Cisco. I’m now working on our Nexus 9000 (N9K) and ACI initiatives. It’s an exciting time to be here and feels similar to the launch we did with UCS 5 years ago (which was more fun that one human should have been allowed).

Anyway, I’ve been working with a new Cisco Nexus 9508 in the lab this week and needed to upgrade the BIOS and NX-OS software on it. It’s part of the new Nexus 9000 portfolio of switches we started shipping in November 2013. The modular line is the Nexus 9500 and fixed line is the 9300. Anyway, I thought I’d record the process because someone else might want to do the same. One thing to keep in mind is that if you are running NX-OS release 6.1.(2)I1.(1) and are upgrading to a release that includes BIOS updates, the update process will be noticeably slow during the BIOS section. This problem is fixed for any future updates once you are running 6.1.(2).I2(1) or higher.

This process is very similar to other Nexus platforms, but not identical. The N9K platform has some key advantages when it comes to upgrades. For starters, there is a single image file used for everything. No more kickstart + system image files that have to match and be maintained. Further, if you download the single image file for the 9508, you now have the image file any Nexus 9000 switch we make (including the 9300 ToR series). Pretty cool stuff.

This video speaks for itself – a pretty simple process to follow, but I thought some written instructions would help. In addition, I’m going to point you to the official Cisco release notes for the version I used. Be sure and always check the release notes for any version you download prior to install. We sometimes introduce better and/or different ways to perform the upgrade and you don’t want to miss something. The release notes can be found here: http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/6-x/release/notes/61_nxos_rn.html

The abbreviated procedure steps for this specific upgrade (I1 to I2) are:

Step 1 Copy the n9000-dk9.6.1.2.I2.1.bin image to bootflash.
Also copy the EPLD .gimg file to bootflash.
Step 2 Change the boot variables to the NX-OS image by entering the following commands:
switch (config)# boot nxos bootflash:n9000-dk9.6.1.2.I2.1.bin
Step 3 Enter the copy running-config startup-config command to set the startup boot variables to the NX-OS image.
* THIS STEP IS VERY IMPORTANT – do not forget or your BIOS upgrade will take much longer (like mine did).
Step 4 Copy the running-configuration file to a backup-configuration file to ensure that you load the running configuration after you make the upgrade.
Step 5 Enter the write erase command. The boot variables remain set (this is why you made a backup).
Step 6 Enter the reload command.
On a Cisco Nexus switch with dual supervisors, an “Autocopy in progress” message might appear when you enter the reload command. Enter No and wait for the auto copy operation to finish.
Step 7 Wait 2 minutes after the reload for all modules to come online before proceeding to the next step.
Step 8 Enter the install all nxos bootflash:n9000-dk9.6.1.2.I2.1.bin command to upgrade the BIOS. If you have successfully booted of I2 NXOS code, the chassis will not reboot at the completion of this step. If not, it will be disruptive and reboot the chassis. Do not attempt to reboot or power off the chassis during this operation. If it reboots, wait 2 minutes after the reload for all the modules to come online before proceeding to the next step.
Run the command “show ver” to verify you are running 6.1.2.I2.1.bin
Step 9 Enter the “install epld bootflash:n9000-epld.6.1.2.I2.1.gimg module all” command to upgrade the EPLD. The chassis will reboot automatically. This is disruptive regardless.
Step 10 Wait 2 minutes after chassis reload and then enter the “install epld bootflash:n9000-epld.6.1.2.I2.1.gimg module all golden” command to upgrade the golden EPLD. The chassis will reboot automatically.
Step 11 Restore the configuration that you saved in Step 4.

Note: I make one mistake in this video that’s not critical, but causes the bios upgrade to take longer than needed. I failed to copy running config to startup after I set the boot variables to I2. This makes the reload I do worthless and it came back up on I1.


Thanks for stopping by

-Jeff

Fabric Interconnect booting to bash?

$
0
0

So, I thought I’d share an experience I had yesterday where my Cisco UCS Fabric Interconnect (FI) wasn’t feeling well and in my attempt to resurrect it, I seemed to break it even more. I’m sure that never happens to you…The FI was now booting to a bash prompt instead of the normal UCS console interface. It would get to the point where it would say “System is coming up….Please wait” and it would say this about 12 times and then display the bash prompt. I won’t bore you with what I actually did in my attempt to get beyond this, but lets just say I spent about 2 hours debugging it when the fix should have only taken about 5 minutes (hindsight is 20/20). It goes without saying that this situation should not happen under normal circumstances, but I’ve heard rumblings of people seeing this here and there after upgrading to 2.2.x. So if Google brought you here looking for a solution, you’re in luck.

All you need to do is:

bash# shutdown -r now

As the FI boots, press Ctrl + r to interrupt the boot process

loader>dir

Get the UCS kickstart file name – it would be preferable for you to boot the actual file name that the FI has in the /installables/switch folder which is the name of the kickstart image that you last installed. This can be found by looking at the working FI and running “scope firmware” and then “show image” (that’s from memory but I think that’s it). However, you can use the kickstart in the root if you can’t figure it out.

loader> boot ucs-6100-k9-kickstart.5.0.3.N2.<xyzabc>.bin

When the system comes up and sits are the boot prompt, run “erase configuration”

switch(boot)# erase configuration

That should do it. The FI needs to reboot and come back up as if it were brand new and ask to create/join the cluster.

Hope this saves someone some time.

-Jeff

The Cisco Nexus 9000 – 10 Cool Features

$
0
0

So, the smart guys and girls in San Jose have been working day and night (literally) to bring you Cisco’s latest Nexus switch family – the Nexus 9000. It includes both the modular 9500 and the Top-of-Rack 9300. I wrote down 10 of coolest features that came to mind that I feel are worth sharing.

  1. There is no midplane. That’s right, if you remove the line cards, fans, and fabric modules, you would leave a hole large enough to pass a small child through – perhaps a hobbit even. You can literally see right through the chassis. Why would we do this? Well, the midplane introduces two challenges to designing a chassis because a) the midplane will need to be replaced for a technology shift from 40G to 100G and b) even though it’s extremely rare, a bent pin on a midplane is a pretty large service event. In a previous life, I worked for a manufacturer where I learned about a contagious datacenter pandemic called “bent pin disease”. It happens when a pin is bent on a midplane and someone inserts a device into that slot. That device is forced into place with the bent pin causing damage the connector on the device itself. The operator realizes it’s not fitting properly and removes it and tries it in another chassis (troubleshooting 101). Because the device has a messed up connector, it damages the pins on the second chassis and this is how the disease spreads from chassis chassis and device to device. Very painful. No midplane – no problem.
  2. Native 40G technology. Need I say more?
  3. Common Image – Both the modular (9500) and Top-of-Rack (9300) boot from the same image file. How is this a benefit to you as the customer? Well, if we find and fix a bug on the 9500, there is no lag in the bug being fixed on the 9300. Pretty cool.
  4. Single System Image – When I said image “file” in #3 above, I meant exactly that – file, as in a singular file. No more kickstart and system images that are a pain to find and match when a switch is down and time is tight.
  5. Improved Patching – We can now patch a bug to an executable or a library inside the image without a whole new system image. This should speed the time to release updates.
  6. XMPP support – Add your switches to your favorite IM client (jabber, messenger, AIM, etc) and simply send an IM to a switch to collect info like “show run int eth 0/1” and get the results instantly.
  7. SMTP Destinations – Tired of logging your SSH session, collecting a switch config, and then attaching the log file to an email? Now you can simply run a command like:
    show run | email <from> <smtp-host> <subject>
    and the switch will email the results of the command to the address specified. You can also pre-define the email configuration using the command ‘email’ while in configuration (conf t) mode.
  8. Linux BASH Shell access – you heard that right. It’s no secret that the OS running under Nexus is a hardened linux. And now we’ve given you access to the actual bash shell so that you can do things like cron jobs, check available system resources (meminfo, ps), etc. From configure mode, type:
    feature bash
    then from exec mode type:
    run bash
  9. In a single 9508 chassis, you can have 288 40G ports or 1152 10G ports. This is achieved via the ability to take a single 40G port and split it into 4 distinct 10G ports using a special break-out cable.
  10. 40G “BiDi” (pronounced bye-dye) – a Cisco exclusive offering that gives you 40G speeds over existing MMF installations in your DC. No need to rip out all you existing fiber to run 40G! These new optics allow you to upgrade your Ethernet network to the latest 40G technology and not bear the burden of new fiber runs. And if that savings isn’t enough, I bet you’ll be pleasantly surprised by the price of these optics!

Thanks for stopping by. I hope to find time to write more in the coming months and keep you updated of all the cool technology available in this new line of datacenter switches.

-Jeff

Erasing a single UCS FI

$
0
0

So, I recently noticed that nowhere on the web (that I could find) is it documented what happens when you run “erase configuration” on a single Fabric Interconnect that is part of a cluster. Does it erase the configuration on just that one FI or does it erase the whole UCS “system” as the command warning says? I know the suspense is killing you….Well, it’s just the single FI an it leaves the other FI and the full configuration intact. This is useful if you need to rebuild the config of a single FI that is part of a cluster.

Cool New UCS Feature–Firmware Sync

$
0
0

So, I had to fix a broken UCS fabric interconnect the other day and I wrote about it here. In that experience, I came across a new feature that is pretty cool. I wish I could say which version of UCS introduced this, but I’m not that “close” to each UCS release these days. But I know it works if the FI is running at least 2.1.3 because that is what I was on.

It used to be that when you attempted to cluster two FI’s together, it required the second FI wanting to join to be at the exact same firmware level. And if it wasn’t already, you had to bypass the cluster, boot standalone, upgrade the firmware, erase the config and then join the cluster. Huge pain. That process is no longer required – now when you attempt join with mis-matched firmware, the Fi will prompt you to sync the firmware on the FI to the one that is running the cluster. Very easy and very cool.

-Jeff

UPDATE: The firmware sync process doesn’t work between different model FI’s. So if you are attempting to upgrade from a 6100 to a 6200 (or future FI), you won;t be able to sync the firmware. The upgrade process is outlined here: http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/upgrading/from2-0/to2-0MR/b_UpgradingCiscoUCSFrom2-0To2-0MR/b_UpgradingCiscoUCSFrom1-4To2-0_chapter_0101.html#concept_D064631BE63A4073BAB2F27D2E8693D0

Viewing all 40 articles
Browse latest View live