UCS Boot-from-SAN Troubleshooting with the Cisco VIC (Part 2)

So, first let me define some terms….the Cisco VIC is also called “Palo” – a codename that sort of stuck (much the chagrin of the marketing team). Palo’s official name is M81KR – now do you see why “Palo” sort of stuck ? We have some new VIC cards as well – the VIC-1240 and VIC-1280 and Sean McGee (@mseanmcgee) talks more about the VIC-1280 here. The VIC-1240 is a built-in option on the M3 blades. Now that we settled that, where is Part 1 of this article? Well, my good friend Ryan Hughes (@angryjesters) got the ball rolling on this. He took it upon himself to write an excellent article explaining how to access the obscure-but-useful command called LUNLIST. So if you are looking for Part 1 to this article, I’m not the author of it. I learned some things reading Ryan’s article, which is not all the surprising since I’m rarely with Ryan when I don’t learn something. You should check out his site if you have not seen the article already, but briefly, LUNLIST is a command that shows you what the Cisco VIC HBA can actually “see” on the fabric – much like a typical HBA BIOS would…but way cooler.

Why am I writing part 2? Well, in the comments to Ryan’s article, a responder noted that during the HBA POST process, the VIC itself will show you if zoning and LUN masking are correct and that LUNLIST may not be needed. While that comment is partly true, LUNLIST is definitely needed and is a great help in troubleshooting. There are prerequisites that must be met for the VIC to show success during POST, and when POST does not show you what you expect, you don’t always know where to start. Is the problem in the profile, the Fabric Interconnect, the upstream FC switch, or the array itself? It’s this kind of thing that makes server administrators irritated with boot from SAN to begin with. There is too much of the setup that is out of their control – and requires a lot of joint troubleshooting with the SAN team. Cisco UCS certainly makes this a lot easier, and I wrote an article back in late 2010 that outlines the basics of a boot from san scenario in UCS. Check it out if you are not familiar with the process, but I believe there is always room for improvement, and this area is no different. So with Cisco UCS 2.0 we introduced LUNLIST (and we’re not close to being done in this area by the way). UCS has a cousin command called LUNMAP that has been around a long time, but LUNLIST is the steroid-using one of the two and when I am troubleshooting, I solely look to LUNLIST. Let’s see why…

As Ryan pointed out, LUNLIST only works prior to the OS HBA driver loading. Once the driver loads, the VIC boot BIOS is no longer in control and will not return valid data. This means that it’s more difficult to use LUNLIST to determine if your configuration is “looser” than it should be by having excess LUNs allowed to the wrong host(s). One reason I like LUNLIST compared with legacy HBA BIOS tools is that I do not have to open a KVM to the server in question and I do not have to catch the server at just the right second during POST. I can just let all the servers attempt to boot, and from one CLI, quickly and easily look at any number of HBA’s in any number of servers. Pretty cool stuff. Another reason I like LUNLIST better is that in a single output, it can tell me if my problem is in the Boot Policy, the zoning config, or the LUN masking. Let’s take a look at some output to show you what I mean.

To get to the command, you need to gain access to the UCS CLI and run the following:

connect: connects to the VICs management processor

attach-fls: attaches to the fabric login service of the adapter

Once you run lunlist, you see output similar to the below. This one is from a server where the end-to-end configuration was all done correctly and the server could boot from SAN or attempt an installation to do so:

Now let’s break it apart and describe what you are seeing:

So, you now may be starting to see the usefulness of this command. But perhaps it will make more sense if you look at the output of a non-working configuration….

Incorrect LUN masking:

Here is the LUNLIST output from a server that is having an issue with incorrect LUN masking. The host has not been allowed access to the LUN. The same problem would likely result if the host is not setup in the array at all, or if it was created on the array but someone mis-typed the host’s WWPN. Zoning is correct because the Nameserver Query Response succeeds (line 11) and returns a WWPN target that matches the WWPN target in the boot policy (line 5). The HBA successfully logged into the fabric and was able to see that a LUN of ID 0×00 is visible (line 9). But when the LUN is queried for additional information, it fails with “access failure” (line 7).
Incorrect Zoning:

In this example, the host is not zoned correctly. It is either in a zone by itself, not zoned at all. This is an easier one to troubleshoot because the host cannot see a LUN nor can it see any available WWPN targets. Look at lines 8 and 9 and notice that there is no response returned for either of these queries. Note that the PLOGI is unsuccessful (fc_id in line 5 is 0×000000) because the host was unable to successfully establish a session with the target.
Incorrect SAN Boot Target in the boot policy:

In this example, you can clearly see that the WWPN configured in the boot policy (line 5) does not match the available target found on the fabric (line 10). In this situation, the PLOGI (line 5) is once again unsuccessful because a session cannot be established between the host and the target.
Incorrect LUN ID in the boot policy

In this example, someone entered the incorrect LUN ID into the boot policy for the server (line 7) and it does not match the LUN ID found on the fabric (line 9).
Lastly, I want to show what it looks like when a properly configured host has multiple LUNs presented. I simulated additional targets in the out below, and I wish I could show you actual multiple targets too, but my lab just isn’t that big . However, if you would like to donate a larger array, I’d be happy to include it in my future examples

6. One last example is what happens when you run LUNLIST and the OS is up and running with the driver for the VIC loaded.you will get this:

That’s pretty much all there is to it. Hopefully this will be useful to you when you need to troubleshoot a UCS blade that’s not booting from SAN. Remember, you need UCS 2.x (or higher) for the command to work and you can only use the command prior to the OS loading. As always, please let me know your feedback and thanks for stopping by.

-Jeff

Article Update:

If you have more than 2 vHBAs, you may need to know which are which. There is an additional command in the adapter shell you can run called “vnicpci” which will list all interfaces along with their associated server interfaces.

Update #2

If you are using rack servers (UCS C-series), the syntax changes slightly to target the specific server (because there is no chassis). To target rack server 5, for example, you would type:

“connect adapter 0/5/1″ (chassis 0, server 5, adapter 1). Technically speaking, you could just use “connect adapter 5/1″ since rack servers do not require a chassis #, but to keep the syntax as close as possible to blades, I add the “0″ in place of the chassis.

Latest Images

Trending Articles

Latest Images