Sunday, January 23, 2011

The Joys of ISSU on Nexus 7000

How many times have you had to fill out a change control document to upgrade code on your network devices where you've detailed the redundancy, portions of the networks impacted, application owners notified only to have it rejected due to "impact"? Prior to my current job at Cisco, this was a common theme. I wished I had a device that would let me roll code without impacting traffic. Fast forward a few years and my wishes have come true with In Service Software Upgrade (ISSU) within NX-OS.


A brief history lesson - Storage switches have had this capability for a long time in the higher end platforms that are considered director class. It makes sense to have ISSU functionality on fibre channel switches because fibre channel as a protocol relies on the network to guarantee delivery of frames. Dropping frames means bad things for storage traffic. Moving the capability for ISSU to Ethernet/IP networks makes sense in a modern data center where high density virtualization and the "always on" mindset prevail. Networking teams have been clamoring for ISSU for a long time. Let's face it, rolling code isn't one of the more exciting things to do on a network, but it's a necessary function, good news is that we now have it.


We'll focus on ISSU on the Nexus series of devices though know that other products in Cisco's portfolio support it. To provide a hitless upgrade capability the device and software require an intrinsic separation of the control plane and data plane. This allows changes to be made in the control plane, like software version, without affecting the data plane, through which the packets and frames that traverse the device pass. NX-OS has been engineered from day one to have this separation of planes. Coupling it with years of experience in ISSU on the Cisco MDS and one of my most favorite features of NX-OS is born.


So enough talk, let's get into the action. To start an ISSU we use the install all command as shown below where we specify the kickstart image and system image to use.


cmhlab-dc2-sw2-otv1# install all kick bootflash:n7000-s1-kickstart.5.1.2.bin system bootflash:n7000-s1-dk9.5.1.2.bin


During the process the install happens before your eyes, which is great for the paranoid amongst us. J


Various components are extracted from the kickstart and system files, and verified to minimize the potential for corruption. Below is a sample of the output.

Verifying image bootflash:/n7000-s1-kickstart.5.1.2.bin for boot variable "kickstart".

[####################] 100% -- SUCCESS

Verifying image bootflash:/n7000-s1-dk9.5.1.2.bin for boot variable "system".
[####################] 100% -- SUCCESS

Verifying image type.

[####################] 100% -- SUCCESS

Extracting "lc1n7k" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "lc1n7k" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.

[####################] 100% -- SUCCESS

Extracting "bios" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "system" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "kickstart" version from image bootflash:/n7000-s1-kickstart.5.1.2.bin.

[####################] 100% -- SUCCESS

Extracting "lc1n7k" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "lc1n7k" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "cmp" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.
[####################] 100% -- SUCCESS

Extracting "cmp-bios" version from image bootflash:/n7000-s1-dk9.5.1.2.bin.

[####################] 100% -- SUCCESS

Performing module support checks
[####################] 100% -- SUCCESS

Notifying services about system upgrade.

[####################] 100% -- SUCCESS

Once that is completed, the install routine also shows the type of upgrade per module, reflecting a rolling upgrade for line cards and reset for the supervisors. Rolling upgrades are non-disruptive as the modules have been engineered to provide this functionality and not drop link to ports or disrupt switching.


Compatibility check is done:


Module bootable Impact Install-type Reason

------ -------- -------------- ------------ ------

2 yes non-disruptive rolling

5 yes non-disruptive reset

6 yes non-disruptive reset

9 yes non-disruptive rolling


Finally, a nice table is presented showing the details of the upgrade and waits for the green light to continue.




Of course we want to proceed and then we see this output.


Install is in progress, please wait.

Performing runtime checks.

[####################] 100% -- SUCCESS

Syncing image bootflash:/n7000-s1-kickstart.5.1.2.bin to standby.

[####################] 100% -- SUCCESS

Syncing image bootflash:/n7000-s1-dk9.5.1.2.bin to standby.
[####################] 100% -- SUCCESS

*NOTE* The install routine automatically copies the files to the redundant supervisor for you.

Setting boot variables.
[####################] 100% -- SUCCESS

Performing configuration copy.
[####################] 100% -- SUCCESS

Module 2: Refreshing compact flash and upgrading bios/loader/bootrom.
Warning: please do not remove or power off the module at this time.
[####################] 100% -- SUCCESS

Module 5: Refreshing compact flash and upgrading bios/loader/bootrom.
Warning: please do not remove or power off the module at this time.
[####################] 100% -- SUCCESS

Module 6: Refreshing compact flash and upgrading bios/loader/bootrom.
Warning: please do not remove or power off the module at this time.
[####################] 100% -- SUCCESS

Module 9: Refreshing compact flash and upgrading bios/loader/bootrom.
Warning: please do not remove or power off the module at this time.
[####################] 100% -- SUCCESS

Module 6: Waiting for module online.
-- SUCCESS
Notifying services about the switchover.
[####################] 100% -- SUCCESS
"Switching over onto standby".
Connection closed by foreign host.

At this point, the supervisor that was the secondary (module 6 in my example) has reload and come up with the new code. This triggers the primary to initiate a Stateful Switch Over (SSO) to the new code running in the control plane. Meanwhile, data is still traversing the switch with no impact. J


Since our telnet session was disconnected during the SSO (telnet isn't SSO aware), we need to re-establish the session and issue a command to continue monitoring the upgrade.


rfuller@cmhlab-tools:~$ telnet cmhlab-dc2-sw2-otv1

Trying 10.2.0.4...

Connected to cmhlab-dc2-sw2-otv1.csc.dublin.cisco.com.

Escape character is '^]'.

User Access Verification
login: admin
Password:
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Copyright (c) 2002-2010, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php

cmhlab-dc2-sw2-otv1# show install all status
There is an on-going installation...
Enter Ctrl-C to go back to the prompt.
Continuing with installation, please wait

Trying to start the installer...
Module 6: Waiting for module online.
-- SUCCESS
2011 Jan 24 02:34:55 cmhlab-dc2-sw2-otv1 %IDEHSD-STANDBY-2-MOUNT: slot0: online
2011 Jan 24 02:35:06 cmhlab-dc2-sw2-otv1 %CMPPROXY-STANDBY-2-LOG_CMP_UP: Connectivity Management processor(on module 5) is now UP
2011 Jan 24 02:37:55 cmhlab-dc2-sw2-otv1 %IDEHSD-STANDBY-2-MOUNT: logflash: online

Module 2: Non-disruptive upgrading.
-- SUCCESS
Module 9: Non-disruptive upgrading.
-- SUCCESS
Install has been successful.
With that, we've upgraded our NX-OS, had the system automatically copy the files to the right locations, modify the boot values and didn't drop a frame. How's that for hot?

cmhlab-dc2-sw2-otv1# show ver i uptime


Kernel uptime is 0 day(s), 0 hour(s), 26 minute(s), 50 second(s)


*NOTE* The Kernel has been up for just a while but we'll see that the overall system has been up much longer


cmhlab-dc2-sw2-otv1# show ver i version

the GNU General Public License (GPL) version 2.0 or the GNU

BIOS: version 3.22.0
kickstart: version 5.1(2)
system: version 5.1(2)

cmhlab-dc2-sw2-otv1# show system uptime
System start time: Tue Oct 26 19:46:38 2010
System uptime: 89 days, 6 hours, 56 minutes, 26 seconds
Kernel uptime: 0 days, 0 hours, 29 minutes, 16 seconds
Active supervisor uptime: 0 days, 0 hours, 19 minutes, 56 seconds

cmhlab-dc2-sw2-otv1#

We'll cover Nexus 5000 and Nexus 1000v and ISSU in the future. Hope it was informative.

22 comments:

  1. Hi Ron,

    Great post! Just a quick word of warning though for everyone else. Just because the Nexus supports ISSU doesn't mean that every upgrade won't be service affecting. You need to pay close attention to the output when running the upgrade and confirm that none of the modules or lines cards won't be reset.

    Mike

    ReplyDelete
  2. Hi Michael, thanks for reading! Great point, always read the output and don't assume the upgrade will be hitless. Also be aware that ISSU checks for network stability and won't perform if STP TCNs are being received either. A great non-disruptive way to see if your upgrade will be impactful is to use the "show install all impact command.

    ReplyDelete
  3. Great Post.
    Always read the release notes, Mostly moving b/w major version 4.2.x to 5.x you need to update the EPLD on the module & which is service disruptive.
    viral

    ReplyDelete
  4. Question for all:
    - Once upgrade is finished, the Standby Sup appears to take over, as it's the first one to upgrade. Will the Active Sup take control over once it's upgraded, or do we need to switch control to Active Sup manually?

    Thanks,

    Anthony

    ReplyDelete
  5. Hi Anthony, the original active supervisor will not take over after the ISSU, you would have to do a system switchover to make it become active. That said, there is no requirement to do this. The switch will run fine on the secondary.

    ReplyDelete
  6. and then some time you run into something like this and the non distruptive upgrade can soon become a nightmare


    Module 9: Refreshing compact flash and upgrading bios/loader/bootrom.
    Warning: please do not remove or power off the module at this time.
    [####################] 98% -- FAIL. Return code 0x40710009 (BIOS write failed).
    CAUTION: The BIOS/loader/bootrom of above module may be in corrupted state. Please try programming it again and DO NOT reboot without programming it successfully, otherwise you have to manually take out the flash from the card and program it in a BIOS programming station.

    Install has failed. Return code 0x40930015 (Pre-upgrade of a module failed).
    Please identify the cause of the failure, and try 'install all' again.

    ReplyDelete
  7. It looks like the xbar was faulty so we replaced the xbar and the install continued flawless

    ReplyDelete
  8. During ISSU, while upgrading the line modules, will the line modules be rebooted after the upgrade? If yes, then this will disrupt the traffic, right?

    ReplyDelete
  9. When upgrading from 4.2.2 to 5.0.5, we ran into a bug, which caused the OSPF process to hang and required a reboot to fix. I'm hoping to find no more bugs going from 5.0.5 to 5.1.5.

    No EPLD upgrades required so far, so no module reboots.

    ReplyDelete
  10. Hi Anonymous,
    The modules will be rebooted but it is non-disruptive. The entire system has been engineered from day one to separate control plane from data plane so we can upgrade the control plane without impacting the user traffic.

    ReplyDelete
    Replies
    1. Rob, when modules get reboot, the devices connected to it would go down. User traffic would definitely get impacted

      Delete
  11. Ron, do you have any insight into why both the Layer 3 daughtercard and root bridging on the 5k break the data plane/control plane separation? The 5k would otherwise be the perfect next core switch for my small enterprise but the loss of hitless firmware upgrades for such a basic feature as layer 3 switching is mind boggling. I'm just trying to wrap my head around where in the architectural design things went wrong on the separation.

    thanks,
    Andrew

    ReplyDelete
  12. Ron, how does the n7k handle fex cards during issu? Do the fex units reload after moving code from to code?

    ReplyDelete
  13. ...web page scraped part of the question.

    Do the fex units reload after moving from previous to new code?

    ReplyDelete
  14. Hi Ron,

    I have problem with my Nexus 7010 upgrading process. I have done it twice and I still fail to upgrade my NX-OS from 6.1(1) to 6.1(2). I got this message:
    “unable to install log files”
    “errno=13”

    After that I tried to reload my system but i got stuck on "Status 90: Loading Boot Loader". After reading some documentations I assume that the problem is in my BIOS which got corrupted and I suspect this is because the upgrade process because from what I know, upgrading your NX-OS also means upgrading the BIOS. CMIIW.

    But since my upgrading process failed at the very first step (if i didn't forget, it failed on the uncompressing image file step) I assume that the BIOS didn't changed and I have tried to check my NX-OS version after those two attempts and found that my NX-OS version still at 6.1(1) which convince me that the upgrade process really really failed.

    I have tried to ask this before to Michael McNamara but unfortunately he never experienced the same problem like mine so he can't give me any advice. Can you help me in this matter? or maybe give me some clues to help me troubleshooting this problem.

    I have plan to contact Cisco TAC support on tommorow Monday, but I think before doing that, I have to try to figure out what's the root of this problem.

    I hope you can help me in this matter. Thanks.


    best regards,

    Yedi

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This is an embarrassingly simple question but I can't find the answer anywhere. If I DO NOT have redundant supervisor modules and and a supervisor needs to reset (say, due to repeated service failure), is that essentially a reboot and stops the nexus 7k from forwarding?

    ReplyDelete
  17. I want to upgrade Nexus 7010 switch NX-OS. Current system Version 5.0(2a). Please suggest me next suitable version to upgrade and train chart. If I want to upgrade to 5.2(9), can I directly upgrade or need to upgrade it to 5.1 and then to 5.2. Please help

    ReplyDelete
  18. Hi, on the 7k, do you have to be connected SSH etc to the standby sup, or is this a dumb question. Im guessing that you woukd be consoled to the 7k???

    Cheers,

    Scott.

    ReplyDelete
  19. Hi Scott,
    You can't SSH to the standby SUP as it's IP stack isn't active until it becomes the active supervisor. I prefer to do ISSU from console and once the active sup switches over, connect to the other supervisor's console to see the ISSU complete.

    ReplyDelete
  20. This may be a useful post but beware bug CSCul22703. I've always had to reload due to BIOS upgrades or power controller upgrades. ISSU can be a mine field!

    ReplyDelete