Don’t Upgrade on Nights, Weekends, or Holidays

This should be self-explanatory to anybody who’s been in operations for any length of time. But for you folks who are new to this whole “production” thing, for those of you into the devops concepts but not so much the devops reality, here’s a clue.

Most organizations are not prepared for upgrades. Especially not when third party tools or software are involved. Oh, you think you are. And maybe you’ve even had the discipline to test this on some other system. Good for you! You’re one step closer to living the dream.

The awful reality, though, is this: your third party vendor’s support organization is not staffed for weekends, nights, and holidays. Oh, they’ll talk a good game about how they are a true 24*7 operation, blah, blah, blah. Every software support organization I’ve ever been on has one, maybe two people on call during the weekend. They’re typically surly, underpaid, and were awake for sixteen hours straight last night hand holding some other customer through some upgrade gone horribly wrong. It will typically take the maximum service level objective for them to get back to you, and deity help you if you haven’t paid for off-hours support for your software — you’ve never heard doors slam so hard as when you wake someone up in the middle of their night without having paid for the service.

The biggest reason you nerds get into this situation is because you don’t have redundant infrastructure, you don’t test failover routinely, and you think off-hours upgrades will minimize impact on your user base. If you want the A-Team to be there if an upgrade goes badly, you’ll need that redundant infrastructure so your users can continue to run while you and your folks faff about with the upgrade, you’ll need to do it during your vendor’s normal work hours. You’ll also need to make sure your support contract is paid-up long beforehand. It’s also a good idea to let your vendor know by submitting a ticket letting them know what you’re planning, when you’re planning, how long the window is, what you’ve done to prepare, and what kind of backups you’ve taken.

Oh, you haven’t taken any backups?

Another thing you’d better know is exactly what the vendor’s service levels, compatibility matrix, scope of support, and how to escalate your support case.

On the support side, there’s nothing more annoying and hateful than a customer who is upgrading their super-sized cluster, who’ve taken no backups, who are doing this in the middle of a holiday week, and who, in the face of a fatal error during the upgrade, bulls right ahead and continues the process, then complains via their CEO to your CEO about the horrible service they’ve received. Sure, the software may be crap (support knows this better than anyone anywhere), but blaming the vendor for your fuck-ups will earn you mad disrespect with the tech support team who have to clean up your mess in the face of inquiries every five minutes by every VP in the company asking them why they haven’t helped Important Customer Asshole.

Be smart. Stop when things get out of hand, ask for help, be patient, test beforehand, always take backups and perform test restores (even more important than backups), assume you’ll have to rebuild the installation from scratch anyway (bonus if you needn’t), be patient, and don’t push the panic button unless your vendor is truly not meeting minimum standards. Have an alternate production system in place to take over should things go pear shaped.

Leave a comment