I’d like to think of this as a chapter in the manual of “CoNE.” Code of Network Ethics. OK, so I made that up. But it should be a thing, right?
How many outages have you experienced where the original problem wasn’t nearly as impactful as the attempted fix? We have all experienced maintenance windows where we tried a fall-forward approach because we didn’t want to back-out the change. And the fall- or fail-forward method cost us an extended maintenance window that bled into the production time.
The impact of these errors in the care and feeding of the network goes well beyond the initial event. The real damage is in reputation. A loss of trust. Because with those scars, comes the apprehension to permit the network team to conduct other maintenance. To severely restrict the network teams’ proposed changes. This leads to another domino falling over. We aren’t updating to newer software versions. We aren’t patching security vulnerabilities in a timely manner. Or worse, we’re not getting the security-related fixes applied at all.
Why rehearsals? Why the repetition? This applies to manual CLI, babysitting a GUI, or automated tasks. The steps and subsequent benefits are life-changing. We’ll regain their trust. We’ll get better at what we do. We’ll provide better service to our internal clients and our paying customers. Heck, with returned trust we may get our nights, early mornings, and even weekends back. Give some thought to the suggestions outlined below and Rehearse, Rehearse, Rehearse.
Measure the Existing
Step away from configuration mode on that production gear. Hold off on pushing play for that automated change.
First, we must measure the current state. In turn, those measurements become valuable historical data. They are the benchmarks of where we are today. And to know where we are 5 minutes before our change as well as 5 minutes after implementing the change.
Without measuring how do we know the state we are currently in? Without evaluation how do we know what we just changed resulted in the intended outcome? How do we apply some ecological philosophy where we vigilantly watch for side effects?
If you added up all of your outage minutes from last year and applied your organization’s “per minute/hour cost” of each outage and then compare that to the cost of the lab gear which one has the higher toll?
The cost of lab gear can suffer a justification hurdle. However, in today’s world, we can leverage free lab software. Most OEMs provide free downloads of rate, performance, or feature-limited software. There’s no reason we shouldn’t be able to mimic our current network or greenfield designs with 75% or better similarity. Sure, there are some things for which hardware is a must. Exact reload times and accurate convergence estimates are best served with those ASICs. But note that it may be easier to justify the need for hardware if you’ve demonstrated a sincere effort to do all you can with free methods.
Rehearse, Rehearse, Rehearse
The lab is where our rehearsals occur. This is our safe stage to learn our lines; to get our timing down. This is batting practice where if we whiff at worse our peers may mock us. But our clients and customers are unaware of failed attempt #1, #2, #7.
A story. We need a story. Let’s set aside our train wreck stories for a minute. Instead, let’s share our success stories. Mine was at a Fortune 20 company. The network had morphed over time. The organization’s hyperactive merger and acquisition approach had left us with 2 Production data centers that had every flavor of generations of vendor gear and protocols. Making changes in this environment had become hazardous to our health. And troubleshooting even the simplest of problems became prolonged investigations that would embarrass Sherlock Holmes for the amount of time it was taking us to prove or disprove a network problem existed. To say nothing of attempting to fix faults or trying to apply bandaids.
So we pleaded for a maintenance window to clean things up. A campaign was launched where both the architecture and operations teams joined forces. But then debates over duration ensued. One side of the aisle insisted an 8-hour network outage was required to tear down and rebuild both data centers and the interconnection between them. The other side of the aisle claimed it could be done without dropping a packet. After weeks of banter, the hitless approach was approved. Great. Oh, wait, what? Do we actually have to pull this off? Not dropping a single packet. Can you imagine? What a tall order. And if we were to fail we’d walk around with egg on our face for the rest of our time employed there.
Have Whiteboard, Will Travel
The network team charged with conducting this change adopted the mantra of Rehearse, refine, rehearse some more, refine. You get it. Rinse and repeat. First, we cobbled together all of the hand-me-down equipment that we could find or order to mock up the production network. We divided sections of the network between team members. The team leads were responsible for the overall end-to-end view. Next, we meticulously crawled the network. And with the same exactness, we drew it up on a massive whiteboard. And I’m not kidding about the traveling whiteboard. Every morning each team member would test their proposed changes on the lab gear. Then in the afternoon, as a team, we’d march down to the cafeteria to sit around a table and analyze the entire implementation plan. That march to the cafeteria wasn’t a sight because 7 of us were walking the halls together. Oh no. It was a scene because it took 2 of us to carry that massive whiteboard. We’d stride to the cafeteria, a whiteboard in two sets of hands trailing behind. In the cafeteria, we propped the whiteboard up on a chair like it was a member of the team.
I once had a co-worker tell me that when I had big thoughts I’d earn a large whiteboard and an appropriate office to accommodate it. I don’t know if big thoughts applied to this situation. Big hopes, big dreams, or perhaps a big dose of insanity might have better described the task at hand.
Each team member would discuss their proposed changes. We reviewed their test results. Then we would reset the configurations to the current state and try out the changes. We’d refine the order so that we could continually force out more and more of the disruptions. We had to tweak, twist and wrestle with each step. Like choreographers, we rehearsed our timing. Calling out who was doing what at each step of the plan. Leaving enough time for the changes. Accommodating time for a back-out should we need to pull that ripcord. Finally, after weeks of preparation, we had managed to wring out all impact. We were ready for game day.
It was a Saturday evening. The business had given us a few hours for the work. And they had no qualms with constantly reminding us that this had to be done without any noticeable impact to any application. In an environment with approximately 5,000 applications, what could go wrong? We joined the bridge call. We began our pre-change health checks. We had tests running all over the place. Pings from here to there. Monitoring tools pulling for data. Synthetic transactions mimicking numerous applications. When we received the green light we started the changes. I’m trying not to sound dramatic, but friend, it was beautiful. Each team member announcing where they were in the process. Each person’s fingers flying across their keyboards. We were synchronized. It was amazing. And, remarkably, it was a 100% success. The meticulous study of the existing network state along with the pinpoint accuracy and coordination of the order of operations yielded us zero packet loss during the change.
If you are going to make a change you should be able to demonstrate the outcome. You can only do these things if you know your inventory, review how things are configured today, measure current performance, record current state, rehearse and refine the changes and repeat the measurements post change to verify the results. Verify, verify, verify. Rehearse, Rehearse, Rehearse. With a track record of success, you’ll earn trust. With that trust, you can build up to permission to make changes in the middle of your company’s business day. You’ll earn back those nights, mornings, and weekends not only for the network team but also for your infrastructure peers.