Re: Disaster Recovery: Tips, tricks, and scary sto...

emily927 · 10-28-2021 03:50 AM

With Halloween just around the corner, let's chat about something scary.... service interruptions. Boo!

Have you ever dealt with a disastrous IT service interruption at work? What did you do? Or if you haven't dealt with one... what is your secret? 😉

Share your tips and tricks in the comments. (No proprietary or confidential info plz - that would be too scary!) The comment with the most "likes" in ~48 hours will get a treat!

PS: Want to learn more about disaster recovery? Here's a handy roundup of resources (and a few free labs).

Abhishek213

Came here for 👻👻 stories😁😁

glen_yu

(WARNING: this will likely be quite lengthy, so the TL;DR with lessons learned/tips are at the very bottom)

Lack of proper documentation and changelogs is just about the worst thing thing you can do if you want to tempt fate. I had the privilege of starting out my cloud/devops career under the tutelage of an awesome team that had instilled a lot of good habits and best practices into me and I've been carrying them on ever since. I've seen a lot of companies where techies just want to "do cool stuff" but not actually write about it or document it and as a result even very minor issues can turn into something major because people just keep getting pulled into a call because "I think he/she might know". Now onto my story...

At my previous company, the director who poached me was someone I had worked with before so he knew I enjoyed a good challenge. They were a small (< 70 people; IT team was maybe 7 people including myself) AWS shop and he had inherited a Kubernetes setup that was poorly setup and undocumented. I have to note here that a managed k8s service such as AWS EKS was not an option because this company operated in a regulated industry in which all its data and stuff needed to stay north of the border (in Canada, if that wasn't clear) and EKS only became GA in Canada in early-mid 2020 I believe. So, the k8s cluster in question was fully self-managed and created with a series of shell scripts and other misc. open sourced tools and the few people who created this monstrosity did it mainly so they could learn about k8s as it was (and still is) a highly valued skillset and once they all got what they wanted, they took that experience and left for the next highest bidder (and didn't leave any documentation on how what scripts did what, etc.). I had k8s experience so I was brought in to try and decipher, document and improve on what was existing. I deciphered it but without go into a large wall of technical text as to why it wasn't feasible to fix what was existing, I'll just say that it was analogous to a house of card using wet playing cards. Furthermore, there had already been a few minor issues and me being the only person who really knew k8s, I was called upon to fix/support it even though that was not part of my role -- but the remaining team members were all very jr and didn't have the k8s knowledge or experience.

I knew the entire setup needed an overhaul but while I could design a better (self-managed) k8s setup, I knew that it would still be too daunting for the team that had k8s PTSD. I knew I needed something simpler and preferably a managed service. I considered AWS ECS but I really wasn't a fan of its (lack of) features, but I had always been a fan of HashiCorp's products and have lots of experience using some of their tools like Terraform and Vault extensively. I knew they had a little known product called Nomad which is quite good but dwarfed by the popularity of k8s. I decided to take a more serious look into the latter and had a very small scale setup going when disaster struck!

The non-prod cluster which our clients used for training & testing purposes had suddenly come to a halt and I wish I could tell you that I was able to fix it, but I couldn't; this time it was beyond what my expertise and my Google-fu could solve. While this was an outage and did impact our clients to some degree, it wasn't prod -- so it could've been worse. I also have to note here that the non-prod cluster was not like the prod cluster(s) at all; prod was likely just an iteration of non-prod. I had to act fast and I liked what I saw in Nomad but I had a very basic setup and need to do more testing and scaling (and then more testing). I spent ~2 weeks (read as: no weekends) on it and finally got a non-prod Nomad cluster out to replace fallen k8s cluster, refactored all the k8s deployments into Nomad jobspecs and restored service. The initial cluster was built manually so that I could make a lot of tweaks along the way. Now it was time to document the process and write out some Terraform code to be able to systematically produce and reproduce the setup easily and reliably. I also built some automation (and documented that too) and by the time all this was done, a couple months had already passed an everything was running smoothly, so now it was time to come up with a game plan to swap out k8s for Nomad in prod -- which we did successfully while also implementing a bunch of network and security enhancements. All in all, it was 6 months of hard work in the making.

TL;DR - Inherited a poorly done self-managed k8s setup, started looking at other potential options. Non-prod k8s cluster failing in an unrepairable manner led to it being replaced by HashiCorp's Nomad 2 weeks later. Prod was replaced 6 months later (and now there was parity between non-prod and prod). Added automation and documentation.

I will say that I was extremely fortunate with the timing of the events:

- I recognized that the solution they had was a little over-engineered and undocumented (two words you never want to see in the same sentence) and not necessarily the best fit for such a small inexperienced IT staff so having begun the process of testing Nomad when the non-prod k8s cluster fell really helped because in an alternate timeline where that I hadn't been proactive in searching for an alternate solution, I may have spent a lot of time trying to fix that broken k8s cluster instead and that probably would've easily been a very, very rough time.

- I was lucky that it was the non-prod k8s cluster that died and not prod because it was still another 2 weeks before I was confident in unleashing the Nomad setup. Had it been prod that had fallen, I probably would not have been afforded the same luxury of time.

- Don't fall into the trap of using what's popular or what would benefit your career. I'm not saying that there's anything wrong with Kubernetes -- only that Kubernetes wasn't the best solution for that company at that time. Even now as a consultant I sometimes see a client's push towards wanting to use Istio because all the cool kids are using services meshes these days -_-" (that should never be the reason to use something...). Also, had it been at a time when a managed Kubernetes service like EKS was available in Canada, that would probably be that path of least resistance in terms of fixing the issues that plagued the self-managed one.

- Testing, automate, then more testing. I had tested the Terraform code I wrote so many times and had other people build and tear down the cluster to make sure they knew what to expect from a process and from an output/messages perspective. This was all done before we made the prod cutover from k8s to Nomad.

- Document. Then document some more.

Abhishek213

don't know happening for the first time here

nageshkumarsing

Don't worry @emily927 is tracking all things and identifying your fake ids

nageshkumarsing

I have won one challenge too last month check my profile

nageshkumarsing

U are liking my post with fake IDs great this won't lead to anything hahahaha

Good try

nageshkumarsing

This trick can't do anything buddy better luck next time

Abhishek213

stop fighting she will handle it

AkshatChatur

some major Git push-pull havoc.

I'll leave at that 😶

richa031

My google meet crashed while giving an online placement exam where we were getting proctored on video.

rakeshmodi

Two cloud services for primary and secondary. When Akamai CDN went down....everything went down.

manavmodi

A service on had an endpoint to list all the claimed properties. This exposed the complete object in JSON. It was possible to find a profile by just having the UUID of the profile. Ooops , this was a leak of Personally identifiable information

Apurav_Gaware

Scariest word for programmers

"SYNTAX ERROR"🤕

Do you agree?

whoam1

I will drink blood of service provider if there is service interruption. I am a zombie. Ha...ha....ha.

Shubham5

Networking in qwiklabs this Halloween with zombies

nageshkumarsing

My wifi got crashed just before 5 minutes of final exam.

What did I do - A simple Restart solve 90 percent of it service disruption problems hahhahahah😂😂

Ps - Please don't use this technique as this may lead to serious problem😂

nageshkumarsing

I am the active participant of this community from last so many months,

Won these ones before, I think u are doing this and trying to manipulate everyone. Stop doing this @akshat .

Abhishek213

don't worry google won't ban you without any reason they cross verify user activity so if you aren't doing any bad practice then stay calm

nageshkumarsing

this is last one I got a shout out

From @emily927

nageshkumarsing

One person with so many ids

#great_work

U are already disqualified. Don't worry😂

Former Community Member

This is scary to most people 🎃 ( Please Ignore if you are scared to listen such stories)

Oneday, It's halloween day, And at this day we had a exam online so, the time given is 1hr but I had left only 5 minutes to make the photos to pdf and submit it. At this time my computer got known that I was in hurry-burry and I was tensed. And suddenly my computer got slow down and the internet started to loose it's strength.👻 And I started thinking how to solve this big Mistry and i hade left just 3-4minutes. I thought to take relax and make myself calm. As soon as I did my computer got to run proper as I expected.😀😀. So later I submitted.

🎃So the truth is that always don't let your computer no your weakness.. 👻👻

Did you feel this type of scary when you are in busy mode? 👍

nageshkumarsing

Yes u are right he is using fake id to like his own post by different names

Abhishek213

This is not a game but @emily927 can easily verify who is using fake id 😂😂😂

No need to worry

Abhishek213

Okay

Abhishek213

No problem all staffs are active here they will automatically ban anyone if they found anything fishy

Ishita

Last sem, during my sem examination ,I wasn't able to upload my answer sheets. Just two minutes were left to decide my result.When I finally realised there's some issue,I asked one my friend to upload answer sheets through my id. And that 's how I was saved😅.

Abhishek213

I was working on an important project and didn't notice cloud shell was in ephemeral mode. A power failure caused the cloud shell to lose connectivity and I lost my project for which i worked .

rijojoy

Testing websites in Internet Explorer 8 and below is the scariest thing I have encountered. Like if anybody encountered this!!!

Shivam118

My Brother just Lost a #LearnToEarn Challenge Water Bottle, with just 1 Lab.

After a month, He's like, " Ohh! I Forgot just one Lab". 😂😂

Rajat2211

Once i need to take interview but printer is not working but somehow i managed to print out from local cafe .

Hahah this is it service disruption hapoened to me

Sandeep97

I am having tcs exam my wifi is disabled again and again but some how manage using my phone usb

Disaster Recovery: Tips, tricks, and scary stories